Background

The file ‘WeatherDownloads_202005_v002.Rmd’ contains code for dowloading and processing historical weather data as contained in METAR archives hosted by Iowa State University.

Data have been dowloaded and processed for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.

This module will perform exploratory data analysis on the processed weather files.

Data Availability

Each processed data file contains one year of hourly weather data for one station. Files are saved as ‘./RInputFiles/ProcessedMETAR/metar_kxxx_yyyy.rds’ where xxx is the three-digit airport code and yyyy is the four-digit year.

Each file contains the following variables:

  • METAR (chr) - the extracted portion of the METAR based on a regex string
  • WindDir (chr) - the previaling wind direction in degrees, stored as a character since ‘VRB’ means variable
  • WindSpeed (int) - the prevailing wind speed in knots
  • WindGust (dbl) - the wind gust speed in knots (NA if there is no recorded wind gust at that hour)
  • Dummy (chr) - artifact, always a blank space
  • Visibility (dbl) - surface visibility in statute miles
  • TempC (int) - temperature in degrees Celsius
  • DewC (int) - dew point in degrees Celsius
  • Altimeter (int) - altimeter in inches of mercury
  • SLP (int) - the raw sea-level-pressure reading from the METAR
  • FahrC (chr) - the raw temperature string pulled from the METAR (Tttttdddd) where tttt is the Fahrenheit temperature recorded in Celsius and dddd is the Fahrenheit dew point recorded in Celsius
  • dtime (dttm) - the date-time associated with the observation
  • origMETAR (chr) - the full METAR associated with the observation
  • TempF (dbl) - the Fahrenheit temperature associated with converting FahrC to Fahrenheit
  • DewF (dbl) - the Fahrenheit dew point associated with converting FahrC to Fahrenheit
  • modSLP (dbl) - Sea-Level Pressure (SLP), adjusted to reflect that SLP is recorded as 0-1000 but reflects data that are 950-1050
  • nSKC (int) - number of times ‘SKC’ (human-confirmed cloud-free) is recorded in the observation (should be 0 or 1)
  • nCLR (int) - number of times ‘CLR’ (austomated-sensor cloud-free) is recorded in the observation (should be 0 or 1, and should never have both nSKC>0 and nCLR>0)
  • cloudn (chr) - the nth cloud layer recorded in the METAR (layers begin with FEW, SCT, BKN, OVC or VV)
  • cTypen (chr) - the cloud type of the nth cloud layer (FEW, BKN, SCT, OVC, or VV)
  • cLeveln (dbl) - the cloud height in feet of the nth cloud layer
  • wType (fct) - highest level of obscuration recorded in the METAR (VV > OVC > BKN > SCT > FEW > CLR/SKC)
  • year (dbl) - year of the observation
  • monthint (dbl) - month of the observation as a number (e.g., 6=June)
  • month (fct) - month of the observation as three-character abbreviation, saved as a factor (e.g., Jun=June)
  • day (int) - day of the month of the observation

Base Functions Available

There are several functions available for analysis:

  • plotCountsByMetric() - bar plots for counts by variable

  • plotNumCor() - plot two numeric variables against each other

  • plotFactorNumeric() - boxplot a numeric variable against a factor variable

  • corMETAR() - correlations between METAR variables

  • lmMETAR() - linear regression modeling for METAR variables

  • basicWindPlots() - plot wind speed and direction

  • getWindDirGroup() - convert wind direction to a grouping (e.g., N for 320-360-40)

  • consolidatePlotWind() - show frequency plots of wind direction, city, and month

The tidyverse library is loaded and the 2016 Detroit data is read in to show examples of the functions:

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
kdtw_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kdtw_2016.rds")

A variable mapping is created to allow for better readable variable names:

varMapper <- c(WindDir="Wind Direction (degrees)", 
               predomDir="General Prevailing Wind Direction",
               WindSpeed="Wind Speed (kts)",
               WindSpeed5="Wind Speed (kts), rounded to nearest 5 knots", 
               Visibility="Visibility (SM)", 
               TempC="Temperature (C)", 
               DewC="Dew Point (C)", 
               Altimeter="Altimeter (inches Hg)",
               Altimeter10="Altimeter (inches Hg), rounded to nearest 0.1 inHg", 
               modSLP="Sea-Level Pressure (hPa)", 
               TempF="Temperature (F)",
               DewF="Dew Point (F)", 
               TempF5="Temperature (F), rounded to nearest 5 degrees",
               DewF5="Dew Point (F), rounded to nearest 5 degrees", 
               cType1="First Cloud Layer Type", 
               cLevel1="First Cloud Layer Height (ft)",
               month="Month", 
               year="Year",
               wType="Greatest Sky Obscuration", 
               day="Day of Month"
               )

The function plotCountsByMetric() produces bar plots for counts by variable:

# Helper function for generating plots by key variables
plotcountsByMetric <- function(df, 
                               mets, 
                               title="", 
                               rotateOn=20, 
                               dropNA=TRUE, 
                               diagnose=FALSE,
                               mapper=varMapper,
                               facetOn=NULL, 
                               showCentral=FALSE
                               ) {
    
    # Function arguments
    # df: dataframe or tibble containing raw data
    # mets: character vector of variables for plotting counts
    # title: character vector for plot title
    # rotateOn: integer, x-axis labels will be rotated by 90 degrees if # categories >= rotateOn
    # dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied df (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    
    # Function usage
    # 1.  By default, the function plots overall counts by metric for a given input
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the overall mean will be plotted as a point on the main plot (only makes sense if facetOn has been selected)
    
    
    # Plot of counts by key metric
    for (x in mets) {
        # If a facetting variable is provided, need to include this in the group_by
        useVars <- x
        if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
        dat <- df %>%
            group_by_at(vars(all_of(useVars))) %>%
            summarize(n=n())
        
        if (dropNA) {
            nOrig <- nrow(dat)
            sumOrig <- sum(dat$n)
            dat <- dat %>%
                filter_all(all_vars(!is.na(.)))
            if (diagnose & (nOrig > nrow(dat))) { 
                cat("\nDropping", 
                    nOrig-nrow(dat), 
                    "rows with", 
                    sumOrig-sum(dat$n), 
                    "observations due to NA\n"
                    )
            }
        }
        
        # Create the main plot
        p <- dat %>%
            ggplot(aes_string(x=x, y="n")) + 
            geom_col() + 
            labs(title=title,
                 subtitle=paste0("Counts By: ", mapper[x]), 
                 x=paste0(x, " - ", mapper[x]),
                 y="Count"
                 )
        # If the rotateOn criteria is exceeded, rotate the x-axis by 90 degrees
        if (nrow(dat) >= rotateOn) {
            p <- p + theme(axis.text.x=element_text(angle=90))
        }
        # If facetting has been requested, facet by the desired variable
        if (!is.null(facetOn)) {
            p <- p + facet_wrap(as.formula(paste("~", facetOn)))
        }
        # If showCentral=TRUE, add a dot plot for the overall average
        if (showCentral) {
            # Get the median number of observations by facet, or the total if facetOn=NULL
            if (is.null(facetOn)) {
                useN <- sum(dat$n)
            } else {
                useN <- dat %>%
                    group_by_at(vars(all_of(facetOn))) %>%
                    summarize(n=sum(n)) %>%
                    pull(n) %>%
                    median()
            }
            # Get the overall percentages by x
            centralData <- helperCountsByMetric(tbl=dat, ctVar=x, sumOn="n") %>%
                mutate(centralValue=nPct*useN)
            # Apply the median
            p <- p + geom_point(data=centralData, aes(y=centralValue), color="red", size=2)
        }
        # Print the plot
        print(p)
    }
}

# Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016, 
                   mets=c("WindDir", "cType1", "month", "wType"), 
                   title="Detroit, MI (2016)"
                   )

The function plotNumCor() plots two numeric variables against one another:

# Create a function for plotting two variables against each other
plotNumCor <- function(met, 
                       var1, 
                       var2, 
                       title=NULL, 
                       subT="", 
                       dropNA=TRUE, 
                       diagnose=FALSE,
                       mapper=varMapper, 
                       facetOn=NULL, 
                       showCentral=FALSE
                       ) {
    
    # Function arguments
    # met: dataframe or tibble containing raw data
    # var1: character vector of variable to be used for the x-axis
    # var2: character vector of variable to be used for the y-axis
    # title: character vector for plot title
    # subT: character vector for plot subtitle
    # dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied met (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    
    # Function usage
    # 1.  By default, the function plots overall counts by the provided x/y metrics, with each point sized based on the number of observations, and with an lm smooth overlaid
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the lm smooth that best first to the overall data will be plotted (only makes sense if facetOn has been selected)
    
    # Create the title if not passed
    if (is.null(title)) { 
        title <- paste0("Hourly Observations of ", mapper[var1], " and ", mapper[var2]) 
    }

    # If a facetting variable is provided, need to include this in the group_by
    useVars <- c(var1, var2)
    if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
        
    # Pull the counts by useVars
    dat <- met %>%
        group_by_at(vars(all_of(useVars))) %>%
        summarize(n=n()) 
    
    # If NA requested to be excluded, remove anything with NA
    if (dropNA) {
        nOrig <- nrow(dat)
        sumOrig <- sum(dat$n)
        dat <- dat %>%
            filter_all(all_vars(!is.na(.)))
        if (diagnose) { 
            cat("\nDropping", 
                nOrig-nrow(dat), 
                "rows with", 
                sumOrig-sum(dat$n), 
                "observations due to NA\n"
                )
        }
    }
    
    p <- dat %>%
        ggplot(aes_string(x=var1, y=var2)) + 
        geom_point(alpha=0.5, aes_string(size="n")) + 
        geom_smooth(method="lm", aes_string(weight="n")) + 
        labs(x=paste0(mapper[var1], " - ", var1), 
             y=paste0(mapper[var2], " - ", var2), 
             title=title, 
             subtitle=subT
             )
    
    # If facetting has been requested, facet by the desired variable
    if (!is.null(facetOn)) {
        p <- p + facet_wrap(as.formula(paste("~", facetOn)))
    }
    # If showCentral=TRUE, add a dashed line for the overall data
    if (showCentral) {
        p <- p + helperNumCor(dat, xVar=var1, yVar=var2, sumOn="n")
    }
    
    print(p)
}

# Example for Detroit 2016 - using TempC and TempF
plotNumCor(kdtw_2016, var1="TempC", var2="TempF", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

# Example for Detroit 2016 - using Altimeter and modSLP
plotNumCor(kdtw_2016, var1="Altimeter", var2="modSLP", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

The function plotFactorNumeric() creates box plots for a numeric variable against a factor variable:

# Updated function for plotting numeric by factor
plotFactorNumeric <- function(met, 
                              fctVar, 
                              numVar, 
                              title=NULL, 
                              subT="", 
                              diagnose=TRUE,
                              showXLabel=TRUE,
                              mapper=varMapper,
                              facetOn=NULL, 
                              showCentral=FALSE
                              ) {
    
    # Function arguments
    # met: dataframe or tibble containing raw data
    # fctVar: character vector of variable to be used for the x-axis (factor in the boxplot)
    # numVar: character vector of variable to be used for the y-axis (numeric in the boxplot)
    # title: character vector for plot title
    # subT: character vector for plot subtitle
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # showXLabel: boolean for whether to include the x-label (e.g., set to FALSE if using 'month')
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied met (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    
    # Function usage
    # 1.  By default, the function creates the boxplot of numVar by fctVar
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the overall median of numVar by fctVar will be plotted as a red dot

    
    # Create the title if not passed
    if (is.null(title)) { 
        title <- paste0("Hourly Observations of ", mapper[numVar], " by ", mapper[fctVar])
    }
    
    # Remove the NA variables
    nOrig <- nrow(met)
    dat <- met %>%
        filter(!is.na(get(fctVar)), !is.na(get(numVar)))
    if (diagnose) { cat("\nRemoving", nOrig-nrow(dat), "records due to NA\n") }
    
    # Create the base plot
    p <- dat %>%
        ggplot(aes_string(x=fctVar, y=numVar)) + 
        geom_boxplot(fill="lightblue") + 
        labs(title=title, 
             subtitle=subT, 
             x=ifelse(showXLabel, paste0(mapper[fctVar], " - ", fctVar), ""), 
             y=paste0(mapper[numVar], " - ", numVar)
             )
    
    # If facetting has been requested, facet by the desired variable
    if (!is.null(facetOn)) {
        p <- p + facet_wrap(as.formula(paste("~", facetOn)))
    }
    
    # If showCentral=TRUE, add a dot plot for the overall average
    if (showCentral) {
        centData <- helperFactorNumeric(dat, .f=median, byVar=fctVar, numVar=numVar)
        p <- p + geom_point(data=centData, aes(y=helpFN), size=2, color="red")
    }

    # Render the final plot
    print(p)
    
}

# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016, 
                  fctVar="month", 
                  numVar="TempF", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=FALSE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

# Example for Detroit 2016 - using WindSpeed and wType
plotFactorNumeric(kdtw_2016, 
                  fctVar="wType", 
                  numVar="WindSpeed", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=TRUE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

# Example for Detroit 2016 - using Visibility and wType
plotFactorNumeric(kdtw_2016, 
                  fctVar="wType", 
                  numVar="Visibility", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=TRUE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

An issue previous observed where visibility 1/16SM was interpreted as 16 statutory miles has been corrected in the ‘WeatherDownloads_202005_v002’ file.

The function corMETAR() calculates correlations among numeric variables in the METAR data:

# Function to calculate, display, and plot variable correlations
corMETAR <- function(met, numVars, subT="") {

    # Keep only complete cases and report on data kept
    dfUse <- met %>%
        select_at(vars(all_of(numVars))) %>%
        filter(complete.cases(.))
    
    nU <- nrow(dfUse)
    nM <- nrow(met)
    myPct <- round(100*nU/nM, 1)
    cat("\n *** Correlations use ", nU, " complete cases (", myPct, "% of ", nM, " total) ***\n", sep="")
    
    # Create the correlation matrix
    mtxCorr <- dfUse %>%
        cor()

    # Print the correlations
    mtxCorr %>%
        round(2) %>%
        print()

    # Display a heat map
    corrplot::corrplot(mtxCorr, 
                       method="color", 
                       title=paste0("Hourly Weather Correlations\n", subT), 
                       mar=c(0, 0, 2, 0)
                       )
}

# Example for Detroit, MI 2016
coreNum <- c("TempC", "TempF", "DewC", "DewF", 
             "Altimeter", "modSLP", "WindSpeed", "Visibility"
             )
corMETAR(kdtw_2016, numVars=coreNum, subT="Detroit, MI (2016) METAR")
## 
##  *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## TempF       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## DewC        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## DewF        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## Altimeter  -0.17 -0.17 -0.22 -0.22      1.00   1.00     -0.37       0.15
## modSLP     -0.24 -0.24 -0.28 -0.28      1.00   1.00     -0.35       0.14
## WindSpeed  -0.10 -0.10 -0.18 -0.18     -0.37  -0.35      1.00       0.10
## Visibility  0.14  0.14 -0.07 -0.07      0.15   0.14      0.10       1.00

The function lmMETAR() runs simple linear regression models on the METAR data:

# Function for linear regressions on METAR data
lmMETAR <- function(met, 
                    y, 
                    x, 
                    yName, 
                    subT=""
                    ) {
    
    # Convert to formula
    myChar <- paste0(y, " ~ ", x)
    cat("\n *** Regression call is:", myChar, "***\n")
    
    # Run regression
    regr <- lm(formula(myChar), data=met)
    
    # Summarize regression
    print(summary(regr))
    
    # Predict the new values
    pred <- predict(regr, newdata=met)
    
    # Plot the predictions
    p <- met %>%
        select_at(vars(all_of(y))) %>%
        mutate(pred=pred) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(vars(all_of(c(y, "pred")))) %>%
        summarize(n=n()) %>%
        ggplot(aes_string(x=y, y="pred")) + 
        geom_point(aes(size=n), alpha=0.25) + 
        geom_smooth(aes(weight=n), method="lm") + 
        labs(title=paste0("Predicted vs. Actual ", yName, " - ", x, " as Predictor"), 
             subtitle=subT, 
             x=paste0("Actual ", yName), 
             y=paste0("Predicted ", yName)
             )
    print(p)
    
}

# Examples for Detroit, MI 2016
lmMETAR(kdtw_2016, "modSLP", "Altimeter", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96361 -0.44022 -0.03758  0.40772  1.41448 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.44389    0.74723  -22.01   <2e-16 ***
## Altimeter    34.41514    0.02488 1383.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4961 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9954, Adjusted R-squared:  0.9954 
## F-statistic: 1.913e+06 on 1 and 8767 DF,  p-value: < 2.2e-16

lmMETAR(kdtw_2016, "modSLP", "Altimeter + TempF", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57802 -0.12674  0.00481  0.12820  0.65708 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.713e+00  2.803e-01  -13.25   <2e-16 ***
## Altimeter    3.403e+01  9.302e-03 3658.80   <2e-16 ***
## TempF       -2.351e-02  9.941e-05 -236.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1826 on 8766 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 7.086e+06 on 2 and 8766 DF,  p-value: < 2.2e-16

lmMETAR(kdtw_2016, "TempC", "DewC", yName="Temperature (C)", subT="Detroit, MI (2016)")
## 
##  *** Regression call is: TempC ~ DewC ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.174 -3.172 -1.165  2.822 17.828 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.17084    0.05373   114.9   <2e-16 ***
## DewC         0.99909    0.00470   212.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.463 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.8375, Adjusted R-squared:  0.8375 
## F-statistic: 4.519e+04 on 1 and 8767 DF,  p-value: < 2.2e-16

The basicWindPlots() function creates plots for wind speed and direction:

# Generate basic wind plots
basicWindPlots <- function(met, 
                           dirVar="WindDir", 
                           spdVar="WindSpeed",
                           desc="", 
                           gran="", 
                           mapper=varMapper
                           ) {

    # Plot for the wind direction
    wDir <- met %>%
        ggplot(aes_string(x=dirVar)) + 
        geom_bar() + 
        labs(title=paste0(desc, " Wind Direction"), subtitle=gran, 
             y="# Hourly Observations", x=mapper[dirVar]
             ) + 
        theme(axis.text.x=element_text(angle=90))
    print(wDir)

    # Plot for the minimum, average, and maximum wind speed by wind direction
    # Wind direction 000 is reserved for 0 KT wind, while VRB is reserved for 3-6 KT variable winds
    wSpeedByDir <- met %>%
        filter(!is.na(get(dirVar))) %>%
        group_by_at(vars(all_of(dirVar))) %>%
        summarize(minWind=min(get(spdVar)), meanWind=mean(get(spdVar)), maxWind=max(get(spdVar))) %>%
        ggplot(aes_string(x=dirVar)) +
        geom_point(aes(y=meanWind), color="red", size=2) +
        geom_errorbar(aes(ymin=minWind, ymax=maxWind)) +
        labs(title=paste0(desc, " Wind Speed (Max, Mean, Min) By Wind Direction"), 
             subtitle=gran,
             y=mapper[spdVar], 
             x=mapper[dirVar]
             ) + 
        theme(axis.text.x=element_text(angle=90))
    print(wSpeedByDir)

    # Plot for the wind speed
    pctZero <- sum(pull(met, spdVar)==0, na.rm=TRUE) / nrow(met)
    wSpeed <- met %>%
        filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
        ggplot(aes_string(x=spdVar)) +
        geom_bar(aes(y=..count../sum(..count..))) +
        labs(title=paste0(round(100*pctZero), "% of wind speeds in ", desc, " measure 0 Knots"),
             subtitle=gran,
             y="% Hourly Observations", 
             x=mapper[spdVar]
             )
    print(wSpeed)
    
    # Polar plot for wind speed and wind direction
    wData <- met %>%
        filter_at(vars(all_of(dirVar)), all_vars(!is.na(.) & !(. %in% c("000", "VRB")))) %>%
        filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
        mutate_at(vars(all_of(dirVar)), as.numeric) %>%
        group_by_at(vars(all_of(c(dirVar, spdVar)))) %>%
        summarize(n=n())
        
    wPolarDirSpeed <- wData %>%
        ggplot(aes_string(x=spdVar, y=dirVar)) +
        geom_point(alpha=0.1, aes(size=n)) +
        coord_polar(theta="y") +
        labs(title=paste0(desc, " Direction vs. Wind Speed"), 
             subtitle=gran, 
             x=mapper[spdVar], 
             y=mapper[dirVar]
             ) +
        scale_y_continuous(limits=c(0, 360), breaks=c(0, 90, 180, 270, 360)) +
        scale_x_continuous(limits=c(0, 40), breaks=c(0, 5, 10, 15, 20, 25, 30, 35, 40)) +
        geom_point(aes(x=0, y=0), color="red", size=2)
    print(wPolarDirSpeed)

}

# Example for Detroit, MI 2016
basicWindPlots(kdtw_2016, desc="Detroit, MI (2016)", gran="KDTW METAR")

The getWindDirGroup() function maps wind direction to a category such as NNE. Because the METAR data are recorded in units of 10 degrees, either 4 groupings (90 degrees each) or 12 groupings (30 degrees each) are preferred, so that each category has the same underlying number of buckets:

# Extract the wind direction data from a processed METAR file
getWindDirGroup <- function(met, src) {
    
    # Use the fullMETAR data and extract WindDir, WindSpeed, month
    windPlotData <- met %>%
        select(WindDir, WindSpeed, month) %>%
        mutate(windDirGroup=factor(case_when(WindSpeed==0 ~ "No Wind", 
                                             WindDir=="VRB" ~ "Variable", 
                                             WindDir %in% c("350", "360", "010") ~ "N", 
                                             WindDir %in% c("020", "030", "040") ~ "NNE", 
                                             WindDir %in% c("050", "060", "070") ~ "ENE", 
                                             WindDir %in% c("080", "090", "100") ~ "E", 
                                             WindDir %in% c("110", "120", "130") ~ "ESE",
                                             WindDir %in% c("140", "150", "160") ~ "SSE", 
                                             WindDir %in% c("170", "180", "190") ~ "S", 
                                             WindDir %in% c("200", "210", "220") ~ "SSW",
                                             WindDir %in% c("230", "240", "250") ~ "WSW", 
                                             WindDir %in% c("260", "270", "280") ~ "W", 
                                             WindDir %in% c("290", "300", "310") ~ "WNW", 
                                             WindDir %in% c("320", "330", "340") ~ "NNW", 
                                             TRUE ~ "Error"
                                             ), 
                                   levels=c("No Wind", "Variable", "Error", 
                                            "N", "NNE", "ENE", 
                                            "E", "ESE", "SSE", 
                                            "S", "SSW", "WSW", 
                                            "W", "WNW", "NNW"
                                            )
                                   )
               )
    
    # Rempve the errors and calculate percentages by month for the remainder
    processedWindData <- windPlotData %>%
        filter(windDirGroup != "Error") %>%
        group_by(month, windDirGroup) %>%
        summarize(n=n()) %>%
        ungroup() %>%
        group_by(month) %>%
        mutate(pct=n/sum(n)) %>%
        ungroup() %>%
        mutate(src=src)
    
    processedWindData

}

The function conslidatePlotWind() then calls getWindDirGroup() for any number of files:

# Consolidate and plot wind data
consolidatePlotWind <- function(files, names) {

    consFun <- function(x, y) { getWindDirGroup(met=x, src=y) }
    boundByRows <- map2_dfr(.x=files, .y=names, .f=consFun)

    # Show frequency by month for each city, faceted by wind direction
    p1 <- boundByRows %>%
        ggplot(aes(x=month, y=pct, color=src)) + 
        geom_line(aes(group=src)) + 
        facet_wrap(~windDirGroup) + 
        labs(title="Wind Frequency by Month", 
             x="Month", 
             y="Frequency of Wind Observations"
             ) +
        theme(axis.text.x=element_text(angle=90))
    print(p1)
    
    # Show frequency by wind direction for each city, faceted by month
    p2 <- boundByRows %>%
        ggplot(aes(x=windDirGroup, y=pct, color=src)) + 
        geom_line(aes(group=src)) + 
        facet_wrap(~month) + 
        labs(title="Wind Frequency by Wind Direction", 
             x="Wind Direction", 
             y="Frequency of Wind Observations"
             ) +
        theme(axis.text.x=element_text(angle=90))
    print(p2)
    
    boundByRows
    
}

# Load the Las Vegas data and New Orleans data for comparison
kmsy_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kmsy_2016.rds")
klas_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_klas_2016.rds")

# Run wind by month comparisons for Detroit, Las Vegas, New Orleans
consolidatePlotWind(files=list(kdtw_2016, klas_2016, kmsy_2016), 
                    names=c("Detroit, MI (2016)", "Las Vegas, NV (2016)", "New Orleans, LA (2016)")
                    )

## # A tibble: 504 x 5
##    month windDirGroup     n     pct src               
##    <fct> <fct>        <int>   <dbl> <chr>             
##  1 Jan   No Wind         46 0.0600  Detroit, MI (2016)
##  2 Jan   Variable         3 0.00391 Detroit, MI (2016)
##  3 Jan   N               26 0.0339  Detroit, MI (2016)
##  4 Jan   NNE             26 0.0339  Detroit, MI (2016)
##  5 Jan   ENE              9 0.0117  Detroit, MI (2016)
##  6 Jan   E               16 0.0209  Detroit, MI (2016)
##  7 Jan   ESE             14 0.0183  Detroit, MI (2016)
##  8 Jan   SSE             72 0.0939  Detroit, MI (2016)
##  9 Jan   S              131 0.171   Detroit, MI (2016)
## 10 Jan   SSW            158 0.206   Detroit, MI (2016)
## # ... with 494 more rows

Combining Functions

The functions can be combined so that a full process can be run for a given file:

# File name to city name mapper
cityNameMapper <- c(kdtw_2016="Detroit, MI (2016)", 
                    kewr_2016="Newark, NJ (2016)",
                    kgrb_2016="Green Bay, WI (2016)",
                    kgrr_2016="Grand Rapids, MI (2016)",
                    kiah_2016="Houston, TX (2016)",
                    kind_2016="Indianapolis, IN (2016)",
                    klas_2015="Las Vegas, NV (2015)",
                    klas_2016="Las Vegas, NV (2016)", 
                    klas_2017="Las Vegas, NV (2017)", 
                    klnk_2016="Lincoln, NE (2016)",
                    kmke_2016="Milwaukee, WI (2016)",
                    kmsn_2016="Madison, WI (2016)",
                    kmsp_2016="Minneapolis, MN (2016)",
                    kmsy_2015="New Orleans, LA (2015)",
                    kmsy_2016="New Orleans, LA (2016)", 
                    kmsy_2017="New Orleans, LA (2017)", 
                    kord_2015="Chicago, IL (2015)",
                    kord_2016="Chicago, IL (2016)", 
                    kord_2017="Chicago, IL (2017)", 
                    ksan_2015="San Diego, CA (2015)",
                    ksan_2016="San Diego, CA (2016)",
                    ksan_2017="San Diego, CA (2017)",
                    ktvc_2016="Traverse City, MI (2016)"
                    )

# This is a helper function to create a locale description
getLocaleDescription <- function(x, mapper=cityNameMapper) {
    
    # Initialize the description as NULL
    desc <- NULL
    
    for (potMatch in names(mapper)) {
        if (str_detect(string=x, pattern=potMatch)) {
            desc <- mapper[potMatch]
            break
        }
    }
    
    # If the mapping failed, use UNMAPPED_x as the description
    if (is.null(desc)) {
        desc <- paste0("UNMAPPED_", x)
        cat("\nUnable to find a description, will use ", desc, "\n\n", sep="")
    } else {
        cat("\nWill use ", desc, " as the description for ", x, "\n\n", sep="")
    }
    
    # Return the descriptive name
    desc
    
}

# The following function runs the functions that work on a single data source
combinedEDA <- function(filename=NULL, 
                        tbl=NULL,
                        desc=NULL,
                        mets=c("WindDir", "WindSpeed", "TempC", "DewC", "Altimeter", 
                               "modSLP", "cType1", "cLevel1", "month", "day"
                               ),
                        corPairs=list(c("TempC", "TempF"), 
                                      c("TempC", "DewC"), 
                                      c("Altimeter", "modSLP"), 
                                      c("Altimeter", "WindSpeed")
                                      ),
                        fctPairs=list(c("month", "TempF"), 
                                      c("month", "DewF"), 
                                      c("month", "WindSpeed"), 
                                      c("month", "Altimeter"), 
                                      c("wType", "Visibility"), 
                                      c("wType", "WindSpeed"), 
                                      c("WindDir", "WindSpeed"), 
                                      c("WindDir", "TempF")
                                      ),
                        heatVars=c("TempC", "TempF", 
                                   "DewC", "DewF", 
                                   "Altimeter", "modSLP", 
                                   "WindSpeed", "Visibility", 
                                   "monthint", "day"
                                   ),
                        lmVars=list(c("modSLP", "Altimeter"), 
                                    c("modSLP", "Altimeter + TempF"), 
                                    c("TempF", "DewF"), 
                                    c("WindSpeed", "Altimeter + TempF + DewF + month")
                                    ),
                        mapVariables=varMapper,
                        mapFileNames=cityNameMapper,
                        path="./RInputFiles/ProcessedMETAR/"
                        ) {
    
    # Require that either filename OR tbl be passed
    if (is.null(filename) & is.null(tbl)) {
        cat("\nMust provide either a filename or an already-loaded tibble\n")
        stop("\nfilename=NULL and tbl=NULL may not both be passed to combinedEDA()\n")
    }
    
    # Require that either 1) filename and mapFileNames, OR 2) desc be passed
    if ((is.null(filename) | is.null(mapFileNames)) & is.null(desc)) {
        cat("\nMust provide either a filename with mapFileNames or a file description\n")
        stop("\nWhen desc=NULL must have non-null entries for both filename= and mapFileNames=\n")
    }
    
    # Find the description if it is NULL (default)
    if (is.null(desc)) {
        desc <- getLocaleDescription(filename, mapper=mapFileNames)
    }
    
    # Warn if both filename and tbl are passed, since tbl will be used
    if (!is.null(filename) & !is.null(tbl)) {
        cat("\nA tibble has been passed and will be used as the dataset for this function\n")
        warning("\nArgument filename=", filename, " is NOT loaded since a tibble was passed\n")
    }
    
    # Read in the file unless tbl has already been passed to the routine
    if (is.null(tbl)) {
        tbl <- readRDS(paste0(path, filename))
    }
    
    # Plot counts by metric
    plotcountsByMetric(tbl, mets=mets, title=desc, diagnose=TRUE)
    
    # Plot relationships between two variables
    for (ys in corPairs) {
        plotNumCor(tbl, var1=ys[1], var2=ys[2], subT=desc, diagnose=TRUE)
    }
    
    # plot numeric vs. factor
    for (ys in fctPairs) {
        plotFactorNumeric(tbl, fctVar=ys[1], numVar=ys[2], subT=desc, showXLabel=FALSE, diagnose=TRUE)
    }
    
    # Heatmap for variable correlations
    corMETAR(tbl, numVars=heatVars, subT=paste0(desc, " METAR"))

    # Run linear rergression
    for (ys in lmVars) {
        lmMETAR(tbl, y=ys[1], x=ys[2], yName=varMapper[ys[1]], subT=desc)
    }
    
    # Run the basic wind plots
    basicWindPlots(tbl, desc=desc, gran="")
    
    # Return the tibble
    tbl
    
}

The process can then be run for each of Detroit (2016), Las Vegas (2016), and New Orleans (2016):

# Retrieve the Detroit, MI (2016) data
kdtw_2016 <- combinedEDA("metar_kdtw_2016.rds")
## 
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
## 
## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 651 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
##  *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## TempF       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## DewC        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## DewF        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## Altimeter  -0.17 -0.17 -0.22 -0.22      1.00   1.00     -0.37       0.15
## modSLP     -0.24 -0.24 -0.28 -0.28      1.00   1.00     -0.35       0.14
## WindSpeed  -0.10 -0.10 -0.18 -0.18     -0.37  -0.35      1.00       0.10
## Visibility  0.14  0.14 -0.07 -0.07      0.15   0.14      0.10       1.00
## monthint    0.24  0.24  0.32  0.32      0.19   0.17     -0.06      -0.11
## day         0.04  0.04  0.03  0.03     -0.02  -0.02      0.06       0.04
##            monthint   day
## TempC          0.24  0.04
## TempF          0.24  0.04
## DewC           0.32  0.03
## DewF           0.32  0.03
## Altimeter      0.19 -0.02
## modSLP         0.17 -0.02
## WindSpeed     -0.06  0.06
## Visibility    -0.11  0.04
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96361 -0.44022 -0.03758  0.40772  1.41448 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.44389    0.74723  -22.01   <2e-16 ***
## Altimeter    34.41514    0.02488 1383.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4961 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9954, Adjusted R-squared:  0.9954 
## F-statistic: 1.913e+06 on 1 and 8767 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57802 -0.12674  0.00481  0.12820  0.65708 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.713e+00  2.803e-01  -13.25   <2e-16 ***
## Altimeter    3.403e+01  9.302e-03 3658.80   <2e-16 ***
## TempF       -2.351e-02  9.941e-05 -236.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1826 on 8766 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 7.086e+06 on 2 and 8766 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.132  -6.065  -2.083   4.042  31.029 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.060663   0.211976   52.18   <2e-16 ***
## DewF         1.000977   0.004677  214.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.987 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.8393 
## F-statistic: 4.58e+04 on 1 and 8767 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3544  -2.5454  -0.1494   2.3669  20.6042 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 306.827720   6.726512  45.615  < 2e-16 ***
## Altimeter    -9.980907   0.222840 -44.790  < 2e-16 ***
## TempF         0.206384   0.006044  34.145  < 2e-16 ***
## DewF         -0.217533   0.006755 -32.204  < 2e-16 ***
## monthFeb      0.188758   0.204057   0.925   0.3550    
## monthMar     -1.375926   0.211244  -6.513 7.75e-11 ***
## monthApr     -2.406187   0.217870 -11.044  < 2e-16 ***
## monthMay     -3.616016   0.248432 -14.555  < 2e-16 ***
## monthJun     -4.337974   0.277532 -15.631  < 2e-16 ***
## monthJul     -3.428235   0.300872 -11.394  < 2e-16 ***
## monthAug     -2.971582   0.312087  -9.522  < 2e-16 ***
## monthSep     -1.839206   0.292614  -6.285 3.43e-10 ***
## monthOct     -0.477868   0.255640  -1.869   0.0616 .  
## monthNov     -0.013008   0.226830  -0.057   0.9543    
## monthDec      2.071836   0.200848  10.315  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.883 on 8754 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.3218, Adjusted R-squared:  0.3207 
## F-statistic: 296.7 on 14 and 8754 DF,  p-value: < 2.2e-16

# Retrieve the Las Vegas, NV (2016) data
klas_2016 <- combinedEDA("metar_klas_2016.rds")
## 
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
## 
## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 2440 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
##  *** Correlations use 8783 complete cases (99.6% of 8818 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.22  0.22     -0.51  -0.62      0.22       0.01
## TempF       1.00  1.00  0.22  0.22     -0.51  -0.62      0.22       0.01
## DewC        0.22  0.22  1.00  1.00     -0.24  -0.27     -0.04      -0.13
## DewF        0.22  0.22  1.00  1.00     -0.24  -0.27     -0.04      -0.13
## Altimeter  -0.51 -0.51 -0.24 -0.24      1.00   0.99     -0.38       0.06
## modSLP     -0.62 -0.62 -0.27 -0.27      0.99   1.00     -0.38       0.06
## WindSpeed   0.22  0.22 -0.04 -0.04     -0.38  -0.38      1.00      -0.02
## Visibility  0.01  0.01 -0.13 -0.13      0.06   0.06     -0.02       1.00
## monthint    0.14  0.14  0.06  0.06      0.06   0.03     -0.01      -0.02
## day        -0.01 -0.01  0.03  0.03     -0.01  -0.01      0.00      -0.07
##            monthint   day
## TempC          0.14 -0.01
## TempF          0.14 -0.01
## DewC           0.06  0.03
## DewF           0.06  0.03
## Altimeter      0.06 -0.01
## modSLP         0.03 -0.01
## WindSpeed     -0.01  0.00
## Visibility    -0.02 -0.07
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1381 -0.7848 -0.0607  0.6891  3.6362 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -116.30984    1.72975  -67.24   <2e-16 ***
## Altimeter     37.68826    0.05776  652.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9753 on 8781 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.9798, Adjusted R-squared:  0.9798 
## F-statistic: 4.258e+05 on 1 and 8781 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18161 -0.33113  0.02395  0.33395  1.08509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.537e+01  8.693e-01  -29.18   <2e-16 ***
## Altimeter    3.478e+01  2.868e-02 1212.89   <2e-16 ***
## TempF       -5.581e-02  2.813e-04 -198.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4165 on 8780 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.9963, Adjusted R-squared:  0.9963 
## F-statistic: 1.187e+06 on 2 and 8780 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.026 -14.381  -0.341  13.299  46.312 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 62.12523    0.49251  126.14   <2e-16 ***
## DewF         0.32810    0.01576   20.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.94 on 8781 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.04705,    Adjusted R-squared:  0.04694 
## F-statistic: 433.5 on 1 and 8781 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1199  -2.7665  -0.6134   2.0975  22.2630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 316.166903   9.583584  32.990  < 2e-16 ***
## Altimeter   -10.363527   0.316290 -32.766  < 2e-16 ***
## TempF         0.012196   0.005239   2.328 0.019933 *  
## DewF         -0.053040   0.004107 -12.915  < 2e-16 ***
## monthFeb      1.477528   0.225729   6.546 6.26e-11 ***
## monthMar      0.848312   0.232319   3.652 0.000262 ***
## monthApr      1.590035   0.242532   6.556 5.84e-11 ***
## monthMay      0.327608   0.261841   1.251 0.210905    
## monthJun      0.141204   0.319290   0.442 0.658325    
## monthJul      1.092983   0.328362   3.329 0.000876 ***
## monthAug      0.545515   0.319165   1.709 0.087450 .  
## monthSep      0.647423   0.281377   2.301 0.021420 *  
## monthOct      1.147640   0.253286   4.531 5.95e-06 ***
## monthNov      1.097005   0.227198   4.828 1.40e-06 ***
## monthDec      0.672432   0.214810   3.130 0.001752 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.181 on 8768 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.1774, Adjusted R-squared:  0.176 
## F-statistic:   135 on 14 and 8768 DF,  p-value: < 2.2e-16

# Retrieve the New Orleans, LA (2016) data
kmsy_2016 <- combinedEDA("metar_kmsy_2016.rds")
## 
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
## 
## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 1029 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
##  *** Correlations use 8799 complete cases (99.8% of 8813 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.85  0.85     -0.48  -0.47     -0.09       0.13
## TempF       1.00  1.00  0.85  0.85     -0.48  -0.48     -0.09       0.13
## DewC        0.85  0.85  1.00  1.00     -0.56  -0.56     -0.21      -0.06
## DewF        0.85  0.85  1.00  1.00     -0.57  -0.56     -0.21      -0.05
## Altimeter  -0.48 -0.48 -0.56 -0.57      1.00   1.00     -0.08       0.08
## modSLP     -0.47 -0.48 -0.56 -0.56      1.00   1.00     -0.08       0.08
## WindSpeed  -0.09 -0.09 -0.21 -0.21     -0.08  -0.08      1.00       0.03
## Visibility  0.13  0.13 -0.06 -0.05      0.08   0.08      0.03       1.00
## monthint    0.26  0.26  0.31  0.31      0.09   0.09     -0.22      -0.02
## day         0.03  0.03  0.07  0.07      0.02   0.02     -0.05       0.01
##            monthint   day
## TempC          0.26  0.03
## TempF          0.26  0.03
## DewC           0.31  0.07
## DewF           0.31  0.07
## Altimeter      0.09  0.02
## modSLP         0.09  0.02
## WindSpeed     -0.22 -0.05
## Visibility    -0.02  0.01
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.232515 -0.083700  0.000544  0.084787  0.227320 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.773456   0.230067   12.05   <2e-16 ***
## Altimeter   33.779607   0.007654 4413.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1015 on 8797 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 1.948e+07 on 1 and 8797 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.222066 -0.082359 -0.001036  0.083533  0.217460 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 1.557e+00  2.638e-01    5.903  3.7e-09 ***
## Altimeter   3.382e+01  8.667e-03 3902.029  < 2e-16 ***
## TempF       8.751e-04  9.432e-05    9.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.101 on 8796 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 9.833e+06 on 2 and 8796 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.051  -4.777  -1.265   4.199  26.477 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.440028   0.319858   76.41   <2e-16 ***
## DewF         0.778877   0.005065  153.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.763 on 8797 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.7288, Adjusted R-squared:  0.7288 
## F-statistic: 2.364e+04 on 1 and 8797 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9452  -2.5710  -0.2247   2.2592  18.9024 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 190.986029  11.812622  16.168  < 2e-16 ***
## Altimeter    -6.223347   0.387357 -16.066  < 2e-16 ***
## TempF         0.225378   0.007290  30.915  < 2e-16 ***
## DewF         -0.169883   0.006513 -26.082  < 2e-16 ***
## monthFeb     -0.787460   0.204817  -3.845 0.000122 ***
## monthMar     -0.270554   0.215758  -1.254 0.209885    
## monthApr     -1.941174   0.225642  -8.603  < 2e-16 ***
## monthMay     -3.763407   0.241926 -15.556  < 2e-16 ***
## monthJun     -5.215285   0.269852 -19.326  < 2e-16 ***
## monthJul     -5.568010   0.283515 -19.639  < 2e-16 ***
## monthAug     -4.834769   0.277122 -17.446  < 2e-16 ***
## monthSep     -5.769099   0.272789 -21.149  < 2e-16 ***
## monthOct     -4.812814   0.245841 -19.577  < 2e-16 ***
## monthNov     -2.827680   0.220138 -12.845  < 2e-16 ***
## monthDec     -0.203742   0.210552  -0.968 0.333241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.847 on 8784 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.2278, Adjusted R-squared:  0.2266 
## F-statistic: 185.1 on 14 and 8784 DF,  p-value: < 2.2e-16

Directing Output to a Separate File

There is a lot of good information in the EDA, but it can be overwhelming to have everything in one place. Perhaps a wrapper function can be built allowing for outputs to be piped to a given file:

wrapCombinedEDA <- function(readFile, 
                            readPath="./RInputFiles/ProcessedMETAR/", 
                            mapFileNames=cityNameMapper,
                            desc=NULL,
                            writeLogFile=NULL,
                            writeLogPDF=NULL,
                            writeLogPath=NULL,
                            appendWriteFile=FALSE,
                            ...
                            ) {
    
    # Read in the requested file
    tbl <- readRDS(paste0(readPath, readFile))

    # Find the description if it has not been passed
    if (is.null(desc)) {
        desc <- getLocaleDescription(readFile, mapper=mapFileNames)
    }
    
    # Helper function that only runs the combinedEDA() routine
    coreFunc <- function() { combinedEDA(tbl=tbl, desc=desc, mapFileNames=mapFileNames, ...) }
    
    # If writeLogPDF is not NULL, direct the graphs to a suitable PDF
    if (!is.null(writeLogPDF)) {
        
        # Prepend the provided log path if it has not been made available
        if (!is.null(writeLogPath)) {
            writeLogPDF <- paste0(writeLogPath, writeLogPDF)
        }
        
        # Provide the location of the EDA pdf file
        cat("\nEDA PDF file is available at:", writeLogPDF, "\n")

        # Redirect the writing to writeLogPDF
        pdf(writeLogPDF)
    }
    
    # Run EDA on the tbl using capture.output to redirect to a log file if specified
    if (!is.null(writeLogFile)) {
        
        # Prepend the provided log path if it has not been made available
        if (!is.null(writeLogPath)) {
            writeLogFile <- paste0(writeLogPath, writeLogFile)
        }
        
        # Provide the location of the EDA log file
        cat("\nEDA log file is available at:", writeLogFile, "\n")
        
        # Run EDA such that the output goes to the log file
        capture.output(coreFunc(), 
                       file=writeLogFile, 
                       append=appendWriteFile
                       )
        
    } else {
        # Run EDA such that output stays in stdout
        coreFunc()
    }
    
    # If writeLogPDF is not NULL, redirect to stdout
    if (!is.null(writeLogPDF)) {
        dev.off()
    }
    
    # Return the tbl
    tbl
    
}
filePath <- "./RInputFiles/ProcessedMETAR/"

# Example for the basic function for Chicago, IL (2016) written to stdout
kord_2016 <- wrapCombinedEDA("metar_kord_2016.rds", readPath=filePath)
## 
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
## 
## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 823 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
##  *** Correlations use 8805 complete cases (99.9% of 8815 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.93  0.93     -0.22  -0.29     -0.12       0.19
## TempF       1.00  1.00  0.93  0.93     -0.22  -0.29     -0.12       0.19
## DewC        0.93  0.93  1.00  1.00     -0.27  -0.34     -0.19       0.05
## DewF        0.93  0.93  1.00  1.00     -0.28  -0.34     -0.19       0.05
## Altimeter  -0.22 -0.22 -0.27 -0.28      1.00   1.00     -0.31       0.19
## modSLP     -0.29 -0.29 -0.34 -0.34      1.00   1.00     -0.29       0.17
## WindSpeed  -0.12 -0.12 -0.19 -0.19     -0.31  -0.29      1.00       0.01
## Visibility  0.19  0.19  0.05  0.05      0.19   0.17      0.01       1.00
## monthint    0.24  0.24  0.27  0.27      0.18   0.16     -0.09       0.02
## day         0.05  0.05  0.05  0.05     -0.06  -0.06      0.08       0.04
##            monthint   day
## TempC          0.24  0.05
## TempF          0.24  0.05
## DewC           0.27  0.05
## DewF           0.27  0.05
## Altimeter      0.18 -0.06
## modSLP         0.16 -0.06
## WindSpeed     -0.09  0.08
## Visibility     0.02  0.04
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31041 -0.47561 -0.07852  0.43888  1.66927 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.11596    0.85978  -26.89   <2e-16 ***
## Altimeter    34.63779    0.02864 1209.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5573 on 8803 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
## F-statistic: 1.463e+06 on 1 and 8803 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69460 -0.12269  0.00055  0.12142  0.73628 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.603e+00  2.864e-01  -16.07   <2e-16 ***
## Altimeter    3.407e+01  9.502e-03 3585.30   <2e-16 ***
## TempF       -2.606e-02  9.503e-05 -274.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1804 on 8802 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 7.018e+06 on 2 and 8802 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.638  -5.485  -1.570   3.601  35.456 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.317877   0.188557   54.72   <2e-16 ***
## DewF         1.004504   0.004095  245.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.407 on 8803 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.8724, Adjusted R-squared:  0.8724 
## F-statistic: 6.019e+04 on 1 and 8803 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.295  -2.635  -0.157   2.494  21.356 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 291.341063   7.156664  40.709  < 2e-16 ***
## Altimeter    -9.415029   0.237118 -39.706  < 2e-16 ***
## TempF         0.160498   0.006445  24.904  < 2e-16 ***
## DewF         -0.205407   0.006912 -29.718  < 2e-16 ***
## monthFeb      0.649436   0.210063   3.092 0.001997 ** 
## monthMar      0.033603   0.220448   0.152 0.878850    
## monthApr      0.092196   0.230608   0.400 0.689317    
## monthMay     -0.859361   0.258300  -3.327 0.000882 ***
## monthJun     -1.564544   0.296635  -5.274 1.36e-07 ***
## monthJul     -1.191767   0.314437  -3.790 0.000152 ***
## monthAug     -1.319054   0.319647  -4.127 3.72e-05 ***
## monthSep      0.394263   0.299972   1.314 0.188768    
## monthOct      0.511074   0.264003   1.936 0.052916 .  
## monthNov      0.050679   0.235873   0.215 0.829882    
## monthDec      1.741405   0.205243   8.485  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.984 on 8790 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.2494, Adjusted R-squared:  0.2482 
## F-statistic: 208.6 on 14 and 8790 DF,  p-value: < 2.2e-16

# Example for the basic function for San Diego, CA (2016) written to 'metar_ksan_2016_EDA.log'
ksan_2016 <- wrapCombinedEDA("metar_ksan_2016.rds", 
                             readPath=filePath, 
                             writeLogFile='metar_ksan_2016_EDA.log',
                             writeLogPDF='metar_ksan_2016_EDA.pdf', 
                             writeLogPath=filePath
                             )
## 
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log

Finally, a function can be called to create the inputs to wrapCombinedEDA() for lower typing:

logAndPDFCombinedEDA <- function(tblName, filePath="./RInputFiles/ProcessedMETAR/") {
    
    # Create the RDS file name
    rdsName <- paste0("metar_", tblName, ".rds")
    cat("\nRDS Name:", rdsName)
    
    # Create the log file name
    logName <- paste0("metar_", tblName, "_EDA.log")
    cat("\nLog Name:", logName)
    
    # Create the PDF file name
    pdfName <- paste0("metar_", tblName, "_EDA.pdf")
    cat("\nPDF Name:", pdfName)
    
    # Call wrapCombinedEDA()
    tbl <- wrapCombinedEDA(rdsName, 
                           readPath=filePath, 
                           writeLogFile=logName, 
                           writeLogPDF=pdfName,
                           writeLogPath=filePath
                           )
    
    # Return the tbl
    tbl
    
}

This function can then be run for all of the relevant files (cached to avoid multiple runs):

# Run for 2016 only for kdtw, kewr, kgrb, kgrr, kiah, kind, klnk, kmke, kmsn, kmsp, ktvc
kdtw_2016 <- logAndPDFCombinedEDA("kdtw_2016")
## 
## RDS Name: metar_kdtw_2016.rds
## Log Name: metar_kdtw_2016_EDA.log
## PDF Name: metar_kdtw_2016_EDA.pdf
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.log
kewr_2016 <- logAndPDFCombinedEDA("kewr_2016")
## 
## RDS Name: metar_kewr_2016.rds
## Log Name: metar_kewr_2016_EDA.log
## PDF Name: metar_kewr_2016_EDA.pdf
## Will use Newark, NJ (2016) as the description for metar_kewr_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.log
kgrb_2016 <- logAndPDFCombinedEDA("kgrb_2016")
## 
## RDS Name: metar_kgrb_2016.rds
## Log Name: metar_kgrb_2016_EDA.log
## PDF Name: metar_kgrb_2016_EDA.pdf
## Will use Green Bay, WI (2016) as the description for metar_kgrb_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.log
kgrr_2016 <- logAndPDFCombinedEDA("kgrr_2016")
## 
## RDS Name: metar_kgrr_2016.rds
## Log Name: metar_kgrr_2016_EDA.log
## PDF Name: metar_kgrr_2016_EDA.pdf
## Will use Grand Rapids, MI (2016) as the description for metar_kgrr_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.log
kiah_2016 <- logAndPDFCombinedEDA("kiah_2016")
## 
## RDS Name: metar_kiah_2016.rds
## Log Name: metar_kiah_2016_EDA.log
## PDF Name: metar_kiah_2016_EDA.pdf
## Will use Houston, TX (2016) as the description for metar_kiah_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.log
kind_2016 <- logAndPDFCombinedEDA("kind_2016")
## 
## RDS Name: metar_kind_2016.rds
## Log Name: metar_kind_2016_EDA.log
## PDF Name: metar_kind_2016_EDA.pdf
## Will use Indianapolis, IN (2016) as the description for metar_kind_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.log
klnk_2016 <- logAndPDFCombinedEDA("klnk_2016")
## 
## RDS Name: metar_klnk_2016.rds
## Log Name: metar_klnk_2016_EDA.log
## PDF Name: metar_klnk_2016_EDA.pdf
## Will use Lincoln, NE (2016) as the description for metar_klnk_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.log
kmke_2016 <- logAndPDFCombinedEDA("kmke_2016")
## 
## RDS Name: metar_kmke_2016.rds
## Log Name: metar_kmke_2016_EDA.log
## PDF Name: metar_kmke_2016_EDA.pdf
## Will use Milwaukee, WI (2016) as the description for metar_kmke_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.log
kmsn_2016 <- logAndPDFCombinedEDA("kmsn_2016")
## 
## RDS Name: metar_kmsn_2016.rds
## Log Name: metar_kmsn_2016_EDA.log
## PDF Name: metar_kmsn_2016_EDA.pdf
## Will use Madison, WI (2016) as the description for metar_kmsn_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.log
kmsp_2016 <- logAndPDFCombinedEDA("kmsp_2016")
## 
## RDS Name: metar_kmsp_2016.rds
## Log Name: metar_kmsp_2016_EDA.log
## PDF Name: metar_kmsp_2016_EDA.pdf
## Will use Minneapolis, MN (2016) as the description for metar_kmsp_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.log
ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
## 
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log
# Run for 2015-2016-2017 for klas, kmsy, kord, ksan
klas_2015 <- logAndPDFCombinedEDA("klas_2015")
## 
## RDS Name: metar_klas_2015.rds
## Log Name: metar_klas_2015_EDA.log
## PDF Name: metar_klas_2015_EDA.pdf
## Will use Las Vegas, NV (2015) as the description for metar_klas_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.log
klas_2016 <- logAndPDFCombinedEDA("klas_2016")
## 
## RDS Name: metar_klas_2016.rds
## Log Name: metar_klas_2016_EDA.log
## PDF Name: metar_klas_2016_EDA.pdf
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.log
klas_2017 <- logAndPDFCombinedEDA("klas_2017")
## 
## RDS Name: metar_klas_2017.rds
## Log Name: metar_klas_2017_EDA.log
## PDF Name: metar_klas_2017_EDA.pdf
## Will use Las Vegas, NV (2017) as the description for metar_klas_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.log
kmsy_2015 <- logAndPDFCombinedEDA("kmsy_2015")
## 
## RDS Name: metar_kmsy_2015.rds
## Log Name: metar_kmsy_2015_EDA.log
## PDF Name: metar_kmsy_2015_EDA.pdf
## Will use New Orleans, LA (2015) as the description for metar_kmsy_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.log
kmsy_2016 <- logAndPDFCombinedEDA("kmsy_2016")
## 
## RDS Name: metar_kmsy_2016.rds
## Log Name: metar_kmsy_2016_EDA.log
## PDF Name: metar_kmsy_2016_EDA.pdf
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.log
kmsy_2017 <- logAndPDFCombinedEDA("kmsy_2017")
## 
## RDS Name: metar_kmsy_2017.rds
## Log Name: metar_kmsy_2017_EDA.log
## PDF Name: metar_kmsy_2017_EDA.pdf
## Will use New Orleans, LA (2017) as the description for metar_kmsy_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.log
kord_2015 <- logAndPDFCombinedEDA("kord_2015")
## 
## RDS Name: metar_kord_2015.rds
## Log Name: metar_kord_2015_EDA.log
## PDF Name: metar_kord_2015_EDA.pdf
## Will use Chicago, IL (2015) as the description for metar_kord_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.log
kord_2016 <- logAndPDFCombinedEDA("kord_2016")
## 
## RDS Name: metar_kord_2016.rds
## Log Name: metar_kord_2016_EDA.log
## PDF Name: metar_kord_2016_EDA.pdf
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.log
kord_2017 <- logAndPDFCombinedEDA("kord_2017")
## 
## RDS Name: metar_kord_2017.rds
## Log Name: metar_kord_2017_EDA.log
## PDF Name: metar_kord_2017_EDA.pdf
## Will use Chicago, IL (2017) as the description for metar_kord_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.log
ksan_2015 <- logAndPDFCombinedEDA("ksan_2015")
## 
## RDS Name: metar_ksan_2015.rds
## Log Name: metar_ksan_2015_EDA.log
## PDF Name: metar_ksan_2015_EDA.pdf
## Will use San Diego, CA (2015) as the description for metar_ksan_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.log
ksan_2016 <- logAndPDFCombinedEDA("ksan_2016")
## 
## RDS Name: metar_ksan_2016.rds
## Log Name: metar_ksan_2016_EDA.log
## PDF Name: metar_ksan_2016_EDA.pdf
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log
ksan_2017 <- logAndPDFCombinedEDA("ksan_2017")
## 
## RDS Name: metar_ksan_2017.rds
## Log Name: metar_ksan_2017_EDA.log
## PDF Name: metar_ksan_2017_EDA.pdf
## Will use San Diego, CA (2017) as the description for metar_ksan_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.log

Reload Traverse City, MI (2016) due to a previous error (short-term fix to avoid re-running a full cache):

ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
## 
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log

The files are now available for further analysis, with individual PDF files for plots associated with each locale.

Comparing Multiple Locales

With EDA about each locale saved to .pdf and .log files, it is interesting to investigate comparisons among the various locales. For example, how do the temperature patterns by month for a given locale compare to the overall global median temperature patterns by month?

The existing functions contain most of the code needed to perform this. The main steps to add are:

  1. Combine processed data files
  2. Adapt the base functions to create the overall mean or median, when requested by parameter
  3. Adapt the base functions to add the overall mean or median to the plot, when requested by parameter
  4. Facet the plot by locale, when requested by parameter

The first step is to combine one or more processed files, with a column added for locale:

# Combine files from a character list
combineProcessedFiles <- function(charList, mapper=cityNameMapper) {
    
    # Combine the objects represented by charList, and name the list items using charList
    listFiles <- lapply(charList, FUN=function(x) { get(x) })
    names(listFiles) <- charList
    
    # Bind rows, and add the descriptive locale name as sourceName
    tblFiles <- bind_rows(listFiles, .id="source") %>%
        mutate(sourceName=mapper[source])
    
    tblFiles
    
}

The 2016 data will be used to run the combined process:

# Grab all the data that ends in _2016
locales2016 <- ls() %>%
    grep(pattern="_2016", value=TRUE)
cat("\nLocales used will be:\n\n", paste0(locales2016, collapse="\n"), "\n\n", sep="")
## 
## Locales used will be:
## 
## kdtw_2016
## kewr_2016
## kgrb_2016
## kgrr_2016
## kiah_2016
## kind_2016
## klas_2016
## klnk_2016
## kmke_2016
## kmsn_2016
## kmsp_2016
## kmsy_2016
## kord_2016
## ksan_2016
## ktvc_2016
# Combine the 2016 data
all2016Data <- combineProcessedFiles(locales2016)

# Show counts by sourceName
all2016Data %>%
    count(source, sourceName)
## # A tibble: 15 x 3
##    source    sourceName                   n
##    <chr>     <chr>                    <int>
##  1 kdtw_2016 Detroit, MI (2016)        8818
##  2 kewr_2016 Newark, NJ (2016)         8821
##  3 kgrb_2016 Green Bay, WI (2016)      8803
##  4 kgrr_2016 Grand Rapids, MI (2016)   8812
##  5 kiah_2016 Houston, TX (2016)        8816
##  6 kind_2016 Indianapolis, IN (2016)   8767
##  7 klas_2016 Las Vegas, NV (2016)      8818
##  8 klnk_2016 Lincoln, NE (2016)        8813
##  9 kmke_2016 Milwaukee, WI (2016)      8808
## 10 kmsn_2016 Madison, WI (2016)        8798
## 11 kmsp_2016 Minneapolis, MN (2016)    8817
## 12 kmsy_2016 New Orleans, LA (2016)    8813
## 13 kord_2016 Chicago, IL (2016)        8815
## 14 ksan_2016 San Diego, CA (2016)      8810
## 15 ktvc_2016 Traverse City, MI (2016)  8814

The following global summaries will be useful:

  • plotCountsByMetric() - overall percentage by metric, applied to total counts for locale
  • plotNumCor() - overall geom_smooth()
  • plotFactorNumeric() - overall mean/median of numeric by factor
  • corMETAR() and lmMETAR() - not applicable, though could be run on full dataset
  • basicWindPlots() - tbd, perhaps adapt along with consolidatePlotWind
  • consolidatePlotWind() - tbd, perhaps adapt along with basicWindPlots

Helper functions can be created for:

  • helperCountsByMetric() - get the overall percentage by metric for variable x
  • helperNumCor() - get an overall geom_smooth for variable y vs. variable x
  • helperFactorNumeric() - get the overall mean or median for numeric variable y by factor variable x

The function helperFactorNumeric is created to apply function f to numeric variable y by factor variable x:

# Helper function to get overall percentage by metric for variable x
helperCountsByMetric <- function(tbl, ctVar, sumOn="dummyVar") {

    tbl %>%
        mutate(dummyVar=1) %>%
        select_at(vars(all_of(c(ctVar, sumOn)))) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(ctVar) %>%
        summarize(n=sum(get(sumOn))) %>%
        mutate(nPct=n/sum(n))
        
}

# Example run to get counts by greatest sky obscuration
helperCountsByMetric(all2016Data, ctVar="wType")
## # A tibble: 7 x 3
##   wType     n    nPct
##   <fct> <dbl>   <dbl>
## 1 VV      761 0.00576
## 2 OVC   39480 0.299  
## 3 BKN   31352 0.237  
## 4 SCT   16212 0.123  
## 5 FEW   19304 0.146  
## 6 CLR   24433 0.185  
## 7 Error   601 0.00455
# Helper function to get a geom_smooth for variable y vs variable x
helperNumCor <- function(tbl, 
                         xVar, 
                         yVar, 
                         sumOn="dummyVar",
                         se=TRUE, 
                         color="red", 
                         method="lm", 
                         lty=2
                         ) {
    
    # Generate the overall totals for sumOn by xVar and yVar
    plotData <- tbl %>%
        mutate(dummyVar=1) %>%
        select_at(vars(all_of(c(xVar, yVar, sumOn)))) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(vars(all_of(c(xVar, yVar)))) %>%
        summarize(nTotal=sum(get(sumOn)))
    
    geom_smooth(data=plotData, 
                aes_string(x=xVar, y=yVar, weight="nTotal"), 
                se=se, 
                color=color, 
                method=method, 
                lty=lty
                )
    
}

# Example run to get TempC vs DewC
helperNumCor(all2016Data, xVar="TempC", yVar="DewC")
## mapping: weight = ~nTotal, x = ~TempC, y = ~DewC 
## geom_smooth: na.rm = FALSE, se = TRUE
## stat_smooth: na.rm = FALSE, se = TRUE, method = lm, formula = y ~ x
## position_identity
# Example for using the helper function on a plot
plotNumCor(kdtw_2016, var1="TempC", var2="DewC") + 
    helperNumCor(all2016Data, xVar="TempC", yVar="DewC")

# Helper function to calculate .f(numVar) by byVar
helperFactorNumeric <- function(tbl, .f, byVar, numVar, ...) {
    
    tbl %>%
        select_at(vars(all_of(c(byVar, numVar)))) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(byVar) %>%
        summarize(helpFN=.f(get(numVar), ...))
    
}

# Example for getting median TempF by month
helperFactorNumeric(all2016Data, .f=median, byVar="month", numVar="TempF")
## # A tibble: 12 x 2
##    month helpFN
##    <fct>  <dbl>
##  1 Jan     30.9
##  2 Feb     35.1
##  3 Mar     46.9
##  4 Apr     52.0
##  5 May     64.0
##  6 Jun     73.0
##  7 Jul     77  
##  8 Aug     75.9
##  9 Sep     71.1
## 10 Oct     61.0
## 11 Nov     50  
## 12 Dec     34.0

The function plotCountsByMetric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:

# Previous Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016, 
                   mets=c("WindDir", "cType1", "month", "wType"), 
                   title="Detroit, MI (2016)", 
                   dropNA=TRUE, 
                   diagnose=TRUE
                   )
## 
## Dropping 1 rows with 49 observations due to NA

# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
    filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))

plotcountsByMetric(useData, 
                   mets=c("WindDir", "cType1", "month", "wType"), 
                   title="Comparison Across Locales (red dots are the median)", 
                   dropNA=TRUE, 
                   diagnose=TRUE, 
                   facetOn="sourceName", 
                   showCentral=TRUE
                   )
## 
## Dropping 4 rows with 117 observations due to NA

As observed previously, Las Vegas tends towards southerly winds while San Diego tends towards northwesterly winds and calm (direction 000) winds. Detroit is most likely to be overcast, while Las Vegas is most likely to be clear.

The function plotNumCor() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:

# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
    filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))

# Facetted plot for very highly correlated variables TempC and TempF
plotNumCor(useData, 
           var1="TempC", 
           var2="TempF", 
           subT="Comparison Across Locales (red dashed lines are the overall)", 
           dropNA=TRUE, 
           diagnose=TRUE, 
           facetOn="sourceName", 
           showCentral=TRUE
           )
## 
## Dropping 4 rows with 117 observations due to NA

# Facetted plot for highly correlated variables TempF and DewF
plotNumCor(useData, 
           var1="TempF", 
           var2="DewF", 
           subT="Comparison Across Locales (red dashed lines are the overall)", 
           dropNA=TRUE, 
           diagnose=TRUE, 
           facetOn="sourceName", 
           showCentral=TRUE
           )
## 
## Dropping 4 rows with 117 observations due to NA

The differences in the relationships between temperature and dew point stand out:

  • Chicago and Detroit have a wide spread of both temperature and dew point, and they tend to rise and fall together
  • San Diego has a narrow spread of both temperature and dew point, and they have a lesser tendency to rise and fall together
  • Las Vegas has very little change in average dew point wven as temperatures range from 30-100 degrees F

In contrast, of course, temperatures measured in C and F all follow the same pattern regardless of city

The function plotFactorNumeric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:

# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016, 
                  fctVar="month", 
                  numVar="TempF", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=FALSE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
    filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))

plotFactorNumeric(useData, 
                  fctVar="month", 
                  numVar="TempF", 
                  subT="Overall median shown as red dot", 
                  showXLabel=FALSE,
                  diagnose=TRUE,
                  facetOn="sourceName", 
                  showCentral=TRUE
                  )
## 
## Removing 117 records due to NA

Las Vegas consistently runs above the overall median temperature, while Chicago and Detroit run below the overall median temperature, particularly during the cold season. San Diego has little temperature variation by month and is thus below the median in the warm season and above the median in the cold season.

It is now possible to re-run the EDA plotting routines, focused on the 2016 data:

# Create rounded TempF and DewF in all2016Data
all2016Data <- all2016Data %>%
    mutate(TempF5=5*round(round(TempF)/5), 
           DewF5=5*round(round(DewF)/5), 
           WindSpeed5=5*round(WindSpeed/5),
           Altimeter10=round(Altimeter, 1)
           )

# Counts by Metric for all 2016 data
plotcountsByMetric(all2016Data, 
                   mets=c("month", "year",
                          "WindDir", "WindSpeed5", 
                          "Visibility", "Altimeter10",
                          "TempF5", "DewF5", 
                          "wType"
                          ), 
                   title="Comparisons Across Locales (red dots are the median)", 
                   facetOn="sourceName",
                   showCentral=TRUE
                   )

The cross-locale comparisons bring out a few salient features:

DATA VOLUMES:
* All locales have roughly the same amount of data by year and month, focused on 2016 and with 1-2 days on either side

WIND DIRECTION and WIND SPEED:
* Las Vegas has an excess of no/variable wind and of southerly winds, both appropriate for a desert
* Houston has an excess of no wind and southerly winds, both appropriate for the Gulf Coast
* San Diego has an excess of no wind and of northwesterly wind, both appropriate for the Pacific coast
* Chicago, Grand Rapids, Indianapolis, Minneapolis, and Newark all have lower occurences of no wind, appropriate for relatively cold mid-latitude cities
* Lincoln looks “about normal”, with the exception that it has slightly more southerly winds; Lincoln is the only Great Plains locale in the analysis, and it appears to be a blend of Gulf Coast and Wintry
* Detroit, Green Bay, Milwaukee, and New Orleans all look “about average”; this is not surprising in the first three cases given the predominance of cold, mid-latitude locales, but is surprising for New Orleans
* Madison and Traverse City are surprising in that both show a predominance of no/variable winds; this is uncommon for cold-weather cities in the mid-latitude and merits further examination

VISIBILITY:
* The overwhelming majority of visibilities are 10SM (the highest that is recorded; more or less means unlimited in the METAR)
* There is a data issue with a Visibility > 10 that should be addressed
* Las Vegas is slightly more likely than most to have unlimited Visibility and Detroit is slightly less likely than most to have unlimited visibility

ALTIMETER:
* Las Vegas skews low as appropriate for a high-altitude desert locale
* New Orleans and Houston show less variance, perhaps driven by being roughly at sea level and in close proximity to the Gulf of Mexico
* San Diego shows very low variance, perhaps driven by being roughly at sea level and in close proximity to the Pacific Ocean

TEMPERATURE:
* Houston, Las Vegas, and New Orleans skew warm as expected
* San Diego has ver low variance as expected
* At a gross level, the other cities look similar to the median, likely driven by the predominance of cold, mid-latitude locales in the data file

DEW POINT:
* Houston and New Orleans skew very high as expected
* Las Vegas skews very low as expected
* San Diego has very low variance as expected

SKY OBSCURATION:
* Lincoln and Green Bay are the most likely to be CLR (clear, no clouds on the automated sensor). This may be driven by a difference in maximum sensor heights, and is unexpected in Green Bay which should be frequently cloudy due to its latitude and proximity to a large body of water
* Detroit, Traverse City, Grand Rapids, and Minneapolis are especially likely to be overcast
* Las Vegas is especially likely to have few clouds or to be clear

Comparisons are run for a few of the numerical correlations:

# Example for 2016 - using mixes of WindSpeed, Altimeter, TempF, DewF, TempC, DewC
numCorList <- list(c("TempC", "TempF"), 
                   c("DewC", "DewF"), 
                   c("TempF", "DewF"), 
                   c("Altimeter", "WindSpeed"), 
                   c("Altimeter", "TempF")
                   )

# Run the list through plotNumCor()
for (x in numCorList) {
    plotNumCor(all2016Data, 
               var1=x[1], 
               var2=x[2], 
               subT="Red dashed line is the overall slope", 
               diagnose=TRUE, 
               facetOn="sourceName", 
               showCentral=TRUE
               )
}
## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

The cross-locale comparisons bring out a few salient features:

FAHRENHEIT AND CELSIUS:
* As expected, TempF/C are perfectly correlated and DewF/C are perfectly correlated. Since the observations were taken in the US, the TempF/DewF data will be used (TempC/DewC are conversions from the measured TempF/DewF to match the international standard for METAR reporting)

TEMPERATURE AND DEW POINT:
* While many cities have different clusters of temperature/dew point, all but Las Vegas and San Diego follow a pattern where the temperature and the dew point tend to run together at a similar rate
* In Las Vegas, the dew point is largely independent of the temperature
* In San Diego, there is less variance in temperature and dew point, and a lower (but still obvious) tendency for tenmperature and dew point to rise/fall together

ALTIMETER AND WIND SPEED:
* As expected, when the altimeter rises, on average, the wind speed falls
* This tendency is less pronounced in Houston and New Orleans; and more pronounced in Las Vegas, Dan Diego, and Indianapolis

ALTIMETER AND TEMPERATURE:
* Overall, low temperatures and high altimeters tend to be observed together
* This is especially so in Houston, Las Vegas, and New Orleans
* This is more modest in Grand Rapids, Green Bay, and Traverse City

Comparisons are then run for numeric variables against factor variables:

# Modify windDir so that it is just N, NE, E, SE, S, SW, W, NW, 000, Variable
mod2016Data <- all2016Data %>%
    mutate(tempDir=ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1, as.numeric(WindDir)),
           predomDir=factor(case_when(is.na(WindDir) ~ "Error", 
                                      WindDir=="000" ~ "000", 
                                      WindDir=="VRB" ~ "VRB", 
                                      tempDir >= 337.5 ~ "N", 
                                      tempDir <= 22.5 ~ "N",
                                      tempDir <= 67.5 ~ "NE", 
                                      tempDir <= 112.5 ~ "E", 
                                      tempDir <= 157.5 ~ "SE", 
                                      tempDir <= 202.5 ~ "S", 
                                      tempDir <= 247.5 ~ "SW", 
                                      tempDir <= 292.5 ~ "W", 
                                      tempDir <= 337.5 ~ "NW",
                                      TRUE ~ "Error"
                                      ), 
                            levels=c("Error", "000", "VRB", "NE", "E", "SE", "S", "SW", "W", "NW", "N")
                            )
           )
## Warning in ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1,
## as.numeric(WindDir)): NAs introduced by coercion
# Key factor variables include month, wType, predomDir
# Key numeric variables include WindSpeed, Altimeter, TempF, DewF, Visibility
fctNumList <- list(c("month", "WindSpeed"), 
                   c("month", "Altimeter"), 
                   c("month", "TempF"), 
                   c("month", "DewF"), 
                   c("month", "Visibility"),
                   c("wType", "WindSpeed"),
                   c("wType", "Altimeter"),
                   c("wType", "Visibility"),
                   c("predomDir", "WindSpeed"),
                   c("predomDir", "Altimeter"),
                   c("predomDir", "TempF"),
                   c("predomDir", "DewF")
                   )

for (x in fctNumList) {
    plotFactorNumeric(mod2016Data, 
                      fctVar=x[1], 
                      numVar=x[2], 
                      subT="Red dots are the overall average", 
                      showXLabel=FALSE,
                      diagnose=TRUE, 
                      facetOn="sourceName",
                      showCentral=TRUE
                      )
}
## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

The cross-locale comparisons bring out a few salient points:

WIND SPEED BY MONTH:
* The plot is too busy; need to rethink, potentially by removing the outliers and plotting only the box

ALTIMETER BY MONTH:
* Same as above; this plot is too busy

TEMPERATURE BY MONTH:
* Seasonal patterns are observed in most of the data, with a warm season centered around July and a cold season centered around December
* Las Vegas stand out for running warmer than the overall average in every month
* Houston, New Orleans, and San Diego stand out for running warmer than the overall average in the cold season and similar to the overall average in the warm season

DEW POINT BY MONTH:
* Seasonal patterns are observed in most of the data, with the humid seasons tracking with the warm seasons
* Houston and New Orleans run consistently above the average dew point by month
* San Diego runs above the average dew point during the cold season
* Las Vegas runs below the average dew point during the warm season

VISIBILITY BY MONTH:
* Visibilities are overwhelmingly likely to be 10 SM
* The Newark, NJ outlier described earlier (19 SM) should be deleted
* Detroit is especially likely to have Visibility less than 10 SM
* Grand Rapids and Traverse City have meaningul occurences of Visibility less than 10 SM during the cold season

WIND SPEED BY SKY OBSCURATION:
* This chart is too busy as per above

ALTIMETER BY SKY OBSCURATION:
* Not much at a glance

VISIBILITY BY SKY OBSCURATION:
* The VV and OVC sky obscurations are most associated with low visibilities; VV in particular is almost always associated with very low visibility

WIND SPEED BY WIND DIRECTION:
* Wind direction “000” is always associated with wind speed 0, as expected
* Wind direction “VRB” is always associated with a low but non-zero wind speed, as expected
* While some cities are windier than others, there is no pronounced tendency for wind speed to be highly associated with a given wind direction in any locale

ALTIMETER BY WIND DIRECTION:
* No gross trends observed
* The plot is rather busy, especially given the low variance in median/IQR for altimeter relative to the outlier points

TEMPERATURE/DEW POINT BY WIND DIRECTION:
* Plot is not great for reading and interpreting

Next steps are to modify a few of the plots for better interpretability (less busy, more variation of the core data metric relative to the full y-axis, etc.), investigate and rectify the data issues observed, and save a version of the file for further analysis.

A modified boxplot function is created to plot only the median and the IQR, with the goal of having more of the variance in the median visible on the plot:

plotMedianIQR <- function(met, 
                          fctVar, 
                          numVar, 
                          title=NULL, 
                          subT="", 
                          mid=0.5,
                          rng=c(0.25, 0.75),
                          diagnose=TRUE,
                          showXLabel=TRUE,
                          mapper=varMapper,
                          facetOn=NULL, 
                          showCentral=FALSE, 
                          ylimits=NULL
                          ) {
    
    # Function arguments
    # met: dataframe or tibble containing raw data
    # fctVar: character vector of variable to be used for the x-axis (factor in the boxplot)
    # numVar: character vector of variable to be used for the y-axis (numeric in the boxplot)
    # title: character vector for plot title
    # subT: character vector for plot subtitle
    # mid: float between 0 and 1 for the quantile to be used as the midpoint
    # rng: length-two float vector for (lo, hi) to be used as the dimensions of the box
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # showXLabel: boolean for whether to include the x-label (e.g., set to FALSE if using 'month')
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied met (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    # ylimits: length-two numeric for the y-axis minimum and maximum (default NULL uses plot defaults)
    
    # Function usage
    # 1.  By default, the function creates a modified boxplot of numVar by fctVar - line at mid, box going from rng[1] to rng[2]
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the overall median of numVar by fctVar will be plotted as a red dot

    
    # Check that the quantile variables are sensible
    if (length(mid) != 1 | length(rng) != 2) {
        stop("Must pass a single value as mid and a length-two vector as rng\n")    
    }
    if (min(c(mid, rng)) < 0 | max(c(mid, rng)) > 1) {
        stop("All values of mid and rng must be between 0 and 1, inclusive\n")
    }
    if ((mid < rng[1]) | (mid > rng[2])) {
        stop("mid must be at least as big as rng[1] and no greater than rng[2]\n")
    }
    quants <- paste0(round(100*c(rng[1], mid, rng[2]), 0), "%", collapse=" ")
    
    # Create the title if not passed
    if (is.null(title)) { 
        title <- paste0("Hourly Observations of ", mapper[numVar], " by ", mapper[fctVar])
    }
    
    # Remove the NA variables
    nOrig <- nrow(met)
    dat <- met %>%
        filter(!is.na(get(fctVar)), !is.na(get(numVar)))
    if (diagnose) { cat("\nRemoving", nOrig-nrow(dat), "records due to NA\n") }
    
    # Create the quantile data by fctVar and (if passed) facetOn
    groupVars <- fctVar 
    if (!is.null(facetOn)) { groupVars <- c(groupVars, facetOn) }
    plotData <- dat %>%
        group_by_at(groupVars) %>%
        summarize(midPoint=quantile(get(numVar), probs=mid), 
                  loPoint=quantile(get(numVar), probs=rng[1]),
                  hiPoint=quantile(get(numVar), probs=rng[2])
                  )
    
    # Create the base plot
    p <- plotData %>%
        ggplot(aes_string(x=fctVar, y="midPoint")) +
        geom_crossbar(aes(ymin=loPoint, ymax=hiPoint), fill="lightblue") +
        labs(title=title,
             subtitle=subT,
             x=ifelse(showXLabel, paste0(mapper[fctVar], " - ", fctVar), ""),
             y=paste0(mapper[numVar], " - ", numVar), 
             caption=paste0("Quantiles plotted: ", quants)
             )
    
    # If facetting has been requested, facet by the desired variable
    if (!is.null(facetOn)) {
        p <- p + facet_wrap(as.formula(paste("~", facetOn)))
    }
    
    # If showCentral=TRUE, add a dot plot for the overall value of 'mid'
    if (showCentral) {
        centData <- helperFactorNumeric(dat, .f=quantile, byVar=fctVar, numVar=numVar, probs=mid)
        p <- p + geom_point(data=centData, aes(y=helpFN), size=2, color="red")
    }
    
    # If ylim has been passed, use it
    if (!is.null(ylimits)) {
        p <- p + ylim(ylimits)
    }

    # Render the final plot
    print(p)
    
}

The function can then be applied in an attempt to get a better look at a few of the comparisons:

# Key factor variables include month, wType, predomDir
# Key numeric variables include WindSpeed, Altimeter, TempF, DewF, Visibility
fctNumListIQR <- list(c("month", "WindSpeed"), 
                      c("month", "Altimeter"), 
                      c("wType", "WindSpeed"),
                      c("predomDir", "Altimeter"),
                      c("predomDir", "TempF"),
                      c("predomDir", "DewF")
                      )

for (x in fctNumListIQR) {
    plotMedianIQR(mod2016Data, 
                  fctVar=x[1], 
                  numVar=x[2], 
                  subT="Red dots are the overall mid-quantile", 
                  showXLabel=FALSE,
                  diagnose=TRUE, 
                  facetOn="sourceName",
                  showCentral=TRUE
                  )
}
## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

Limiting the observations to Q1-Median-Q2 brings out a bit more information in the plots:

WIND SPEED BY MONTH:
* San Diego and Las Vegas are meaningfully less windy than the median, especially during the cold season

ALTIMETER BY MONTH:
* San Diego has very little variance in altimeter relative to the other locales
* Las Vegas runs especially low on altimeter during the warm season
* There is a seasonal pattern where altimeters tend to run lower during the warm season while also showing a much smaller Q1-Q3 range

WIND SPEED BY SKY OBSCURATION:
* Newark shows very high wind speeds when obscuration is VV
* Should re-run with data y-axis capped at 15 to visualize the other locales

ALTIMETER BY WIND DIRECTION:
* Altimeters tend to be a bit lower when winds are from the S-SW-W-NW, a pattern that seems consistent across all locales
* Las Vegas runs especially low on altimeter relative to other locales when winds are from E-SE-S

TEMPERATURE BY WIND DIRECTION:
* Variable winds are associated with higher temperatures than calm (speed 0) winds
* Winds from W-NW are generally associated with somewhat lower temperatures

DEW POINT BY WIND DIRECTION:
* Winds from SW-W-NW-N are associated with generally lower dew points than winds from NE-E-SE-S
* Las Vegas typically runs low on dew points while Houston and New Orleans typically run high on dew points
* San Diego has little variance in dew points and does not see any dip when winds are from SW-W-NW-N

The wind speed vs. sky obscuration plot is re-run with y-limits that suppress Newark VV:

plotMedianIQR(mod2016Data, 
              fctVar="wType", 
              numVar="WindSpeed", 
              subT="Red dots are the overall mid-quantile", 
              showXLabel=FALSE,
              diagnose=TRUE, 
              facetOn="sourceName",
              showCentral=TRUE, 
              ylimits=c(0, 15)
              )
## 
## Removing 593 records due to NA
## Warning: Removed 1 rows containing missing values (geom_crossbar).

Low vertical visibilities (VV) are generally associated with lower winds. There is also a slight tendency for more obscured skies (OVC) to be associated with slightly stronger winds.

Further, the previous analysis for counts by wind direction is re-run using the predominant directions:

# Counts by Metric for predomDir using mod2016Data
plotcountsByMetric(mod2016Data, 
                   mets=c("predomDir"), 
                   title="Comparisons Across Locales (red dots are the median)", 
                   facetOn="sourceName",
                   showCentral=TRUE
                   )

Findings include:
* Across all locales, winds are more commonly from S-SW-W-NW-N than from NE-E-SE
* San Diego is especially likely to have either 0 wind or wind from W-NW
* Traverse City is especially likely to have 0 wind or variable wind (very surprising, and possibly a sign of anomalous data)
* Minneapolis is especially likely to experience winds from NE
* New Orleans and Houston rarely experience winds from SW-W-NW relative to other locales
* Lincoln and Las Vegas are especially pronse to winds from S (and for Las Vegas SW)

There are several issues identified that should be explored and fixed if appropriate:

  • Vertical visibility greater than 10 recorded
  • Tendency for maximum obscuration to be CLR in Lincoln and Green Bay
  • Tendency for Traverse City to have no or variable winds

Cloud Data Exploration

Cloud exploration can highlight whether all locales are being compared apples to apples. This is especially the case if some locales have a different maximum sensor height:

# Select the source, sourceName, dtime and cLevel variables; pivot cLevel down
cLevels <- mod2016Data %>%
    select(source, sourceName, dtime, starts_with("cLevel")) %>%
    pivot_longer(-c(source, sourceName, dtime), names_to="level", values_to="height") %>%
    mutate(level=as.integer(str_replace(level, pattern="cLevel", replacement="")))

# Select the source, sourceName, dtime and cType variables; pivot cLevel down
cTypes <- mod2016Data %>%
    select(source, sourceName, dtime, starts_with("cType")) %>%
    pivot_longer(-c(source, sourceName, dtime), names_to="level", values_to="type") %>%
    mutate(level=as.integer(str_replace(level, pattern="cType", replacement="")))

cData <- cLevels %>%
    inner_join(cTypes, by=c("source", "sourceName", "dtime", "level"))

# Plot cloud heights, using only non-NA
cData %>%
    filter(!is.na(height)) %>%
    ggplot(aes(x=fct_reorder(sourceName, height, .fun=max, na.rm=TRUE), y=height)) + 
    geom_violin(fill="lightblue") + 
    coord_flip() + 
    labs(x="", y="Cloud Height (feet)", title="Density of cloud heights by locale")

There are clearly differences in maximum cloud height recorded by locale:

  • 4 locales record clouds up to 35,000 feet
  • 5 locales record clouds up to 30,000 feet; there appear to be few if any clouds recorded above 30,000 feet, so cloud distributions may not be meaningfully impacted by this
  • 3 locales record clouds up to 25,000 feet; there are meaningful clouds between 25,000 and 30,000 feet, so these locales may appear to be “more clear than normal” purely due to maximum recorded height
  • 2 locales record heights up to 12,000 feet (Lincoln and Green Bay); these are the cities that appeared anomalously clear in the EDA, and the reason is thet they exclude mid/high level cloudiness

The distribution of cloud types observed can also be assessed:

# Plot cloud heights, using only non-""
fctLayers <- c("VV", "OVC", "BKN", "SCT", "FEW")

cData %>%
    filter(type!="") %>%
    mutate(type=factor(type, levels=fctLayers)) %>%
    ggplot(aes(x=fct_reorder(sourceName, height, .fun=max, na.rm=TRUE), fill=type)) + 
    geom_bar(position="stack") + 
    coord_flip() + 
    labs(x="", y="Cloud Layer Obscuration", title="Cloud obscuration by locale") + 
    scale_fill_discrete("", rev(fctLayers)) + 
    theme(legend.position="bottom")

Lincoln and Green Bay stand out for having fewer clouds, likely due to the inability to catch the higher altocumulus and any cirrus cloud (both very common in the mid-latitudes).

Supposing that only clouds of 12,000 feet and under are considered:

# Plot cloud heights, using only non-NA
cData %>%
    filter(!is.na(height)) %>%
    filter(height <= 12000) %>%
    ggplot(aes(x=sourceName, y=height)) + 
    geom_violin(fill="lightblue") + 
    coord_flip() + 
    labs(x="", y="Cloud Height (feet)", title="Density of cloud heights by locale")

cData %>%
    filter(type!="") %>%
    filter(height <= 12000) %>%
    mutate(type=factor(type, levels=fctLayers)) %>%
    ggplot(aes(x=fct_reorder(sourceName, sourceName, .fun=length), fill=type)) + 
    geom_bar(position="stack") + 
    coord_flip() + 
    labs(x="", y="Cloud Layer Obscuration", title="Cloud obscuration by locale (up to 12,000 feet)") + 
    scale_fill_discrete("", rev(fctLayers)) + 
    theme(legend.position="bottom")

The patterns are much more plausible:

  • San Diego, known for marine layer, has clouds that skew very low and more obscured
  • Las Vegas, a desert, has clouds that skew higher and less obscured
  • The remaining cities have maximum cloud densities in the 2500-5000 foot range, very common cloud heights in the US
  • Lincoln and Green Bay remain among the least cloudy locales, though this may be due to hitting OVC (cloud layers above an OVC layer are not observed by the sensor/human and thus not recorded)

Further investigation of the cloud data may be interesting.

A function is written to take only cloud data up through height x, and to add a layer of clouds that are “clear” at a height that is out-of-interval:

# Filter to only clouds up to and including height
cloudsLevel0 <- function(df, 
                         maxHeight, 
                         byVars,
                         baseLevel=0,
                         heightBase=-100,
                         typeBase="CLR"
                         ) {
    
    # Function assumptions
    # Input data are unique by byVars-level and with columns 'height' and 'type'
    # Clouds increase in height with level
    # Clouds are non-decreasing in type (VV > OVC > BKN > SCT > FEW) with level
    
    # FUNCTION ARGUMENTS:
    # df: tibble or dataframe contiaining the clouds data
    # maxHeight: the maximum height to consider (delete all heights above this level)
    # byVars: the variables that make up a unique observation (df should be unique by byVars-level)
    # baseLevel: the level to be created as the base level (by default, level 0)
    # heightBase: the height to be provided to the base level (by default, -100 feet)
    # typeBase: the type of obscuration observed at the base level (by default, CLR)
    
    # Add a cloud level 0 that has height -100 (by default)
    # Include only levels where the cloud height is not NA
    # Include only levels where the cloud height is less than or equal to maxHeight
    modData <- df %>% 
        group_by_at(vars(all_of(byVars))) %>% 
        summarize(level=baseLevel, height=heightBase, type=typeBase) %>% 
        ungroup() %>% 
        bind_rows(df) %>% 
        arrange_at(vars(all_of(c(byVars, "level")))) %>% 
        filter(!is.na(height)) %>% 
        filter(height <= maxHeight)
    
    modData
    
}

The function is then run using the existing cData:

modCData <- cloudsLevel0(cData, maxHeight=12000, byVars=c("source", "sourceName", "dtime"))
modCData <- modCData %>%
    mutate(type=factor(type, levels=c("VV", "OVC", "BKN", "SCT", "FEW", "CLR")))
modCData
## # A tibble: 260,324 x 6
##    source    sourceName         dtime               level height type 
##    <chr>     <chr>              <dttm>              <dbl>  <dbl> <fct>
##  1 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00     0   -100 CLR  
##  2 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00     1   2500 BKN  
##  3 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00     2   5000 OVC  
##  4 kdtw_2016 Detroit, MI (2016) 2015-12-31 01:53:00     0   -100 CLR  
##  5 kdtw_2016 Detroit, MI (2016) 2015-12-31 01:53:00     1   2100 OVC  
##  6 kdtw_2016 Detroit, MI (2016) 2015-12-31 02:53:00     0   -100 CLR  
##  7 kdtw_2016 Detroit, MI (2016) 2015-12-31 02:53:00     1   2100 OVC  
##  8 kdtw_2016 Detroit, MI (2016) 2015-12-31 03:53:00     0   -100 CLR  
##  9 kdtw_2016 Detroit, MI (2016) 2015-12-31 03:53:00     1   2500 OVC  
## 10 kdtw_2016 Detroit, MI (2016) 2015-12-31 04:53:00     0   -100 CLR  
## # ... with 260,314 more rows

An additional function is written to take processed cloud data and to designate 1) the minimum cloud height, 2) the minimum cloud ceiling (a ceiling exists with BKN, OVC, or VV layers), and 3) the maximum obscuration level (using CLR if no clouds exist in that area):

# Helper function to pull out the minimum cloud height, limited to certain obscurations
getMinimumHeight <- function(df, 
                             byVars, 
                             types, 
                             baseLevel=0
                             ) {
    
    # Split the data in to the baseLevel and all other levels
    baseData <- df %>%
        filter(level==baseLevel)
    layerData <- df %>%
        filter(level!=baseLevel)
        
    # Take the layerData, limit to type in types, and find the minimum level
    layerData <- layerData %>%
        filter(type %in% types) %>%
        group_by_at(vars(all_of(byVars))) %>%
        filter(level==min(level)) %>%
        ungroup()
    
    # Put the data back together
    # Keep the maximum level for each set of byVars (will be 0 if no data for byVars in layerData)
    cloudData <- baseData %>%
        bind_rows(layerData) %>%
        arrange_at(vars(all_of(c(byVars, "level")))) %>%
        group_by_at(vars(all_of(byVars))) %>%
        filter(level==max(level))
    
}

# Extract the minimum cloud height, minimum ceiling height, and maximum obscuration
hgtCeilObsc <- function(df, 
                        byVars,
                        baseLevel=0
                        ) {

    # Function assumptions
    # Input data are unique by byVars-level and with columns 'height' and 'type'
    # For each byVars, a row with level=baseLevel, height=heightBase, type=typeBase has been created
    # Clouds increase in height with level
    # Clouds are non-decreasing in type (VV > OVC > BKN > SCT > FEW) with level
    
    # FUNCTION ARGUMENTS:
    # df: tibble or dataframe contiaining the clouds data
    # byVars: the variables that make up a unique observation (df should be unique by byVars-level)
    # baseLevel: the base level in df (by default, level 0)
    # heightBase: the height of the base level in df (by default, -100 feet)
    # typeBase: the type of obscuration at the base level in df (by default, CLR)
    
    # Get the maximum obscuration
    maxObsc <- df %>%
        group_by_at(vars(all_of(byVars))) %>% 
        filter(level==max(level)) %>%
        ungroup()
    
    # Get the minimum height (any type)
    minHeight <- getMinimumHeight(df, 
                                  byVars=byVars, 
                                  types=c("VV", "OVC", "BKN", "SCT", "FEW"), 
                                  baseLevel=baseLevel
                                  )
    
    # Get the minimum ceiling height (VV, OVC, BKN)
    minCeiling <- getMinimumHeight(df, 
                                   byVars=byVars, 
                                   types=c("VV", "OVC", "BKN"), 
                                   baseLevel=baseLevel
                                   )

    # Put the file together
    minCeiling <- minCeiling %>%
        rename(ceilingHeight=height, ceilingType=type, ceilingLevel=level)
    minHeight <- minHeight %>%
        rename(cloudHeight=height, cloudType=type, cloudLevel=level)
    maxObsc <- maxObsc %>%
        rename(obscHeight=height, obscType=type, obscLevel=level)
    
    # Merge
    cloudSummary <- maxObsc %>%
        full_join(minHeight, by=byVars) %>%
        full_join(minCeiling, by=byVars)
    
    cloudSummary
    
}

The function can then be run on the modified clouds data:

# Get the key clouds data
cloudSummary <- hgtCeilObsc(modCData, byVars=c("source", "sourceName", "dtime"))

# Check for consistency
cloudSummary %>%
    count(ceilingType, obscType)
## # A tibble: 7 x 3
##   ceilingType obscType     n
##   <fct>       <fct>    <int>
## 1 VV          VV         761
## 2 OVC         OVC      22654
## 3 BKN         OVC      10054
## 4 BKN         BKN      17198
## 5 CLR         SCT      12456
## 6 CLR         FEW      22711
## 7 CLR         CLR      46309
cloudSummary %>%
    count(cloudType, ceilingType)
## # A tibble: 10 x 3
##    cloudType ceilingType     n
##    <fct>     <fct>       <int>
##  1 VV        VV            761
##  2 OVC       OVC         17900
##  3 BKN       BKN         15436
##  4 SCT       OVC          1848
##  5 SCT       BKN          4936
##  6 SCT       CLR          8243
##  7 FEW       OVC          2906
##  8 FEW       BKN          6880
##  9 FEW       CLR         26924
## 10 CLR       CLR         46309
cloudSummary %>%
    count(cloudType, obscType)
## # A tibble: 12 x 3
##    cloudType obscType     n
##    <fct>     <fct>    <int>
##  1 VV        VV         761
##  2 OVC       OVC      17900
##  3 BKN       OVC       6384
##  4 BKN       BKN       9052
##  5 SCT       OVC       3444
##  6 SCT       BKN       3340
##  7 SCT       SCT       8243
##  8 FEW       OVC       4980
##  9 FEW       BKN       4806
## 10 FEW       SCT       4213
## 11 FEW       FEW      22711
## 12 CLR       CLR      46309

Plots for the maximum obscuration (through 12,000 feet) can then be created:

plotMaxObsc <- function(df, 
                        xVar, 
                        fillVar, 
                        title, 
                        subtitle="Up to and including 12,000 feet",
                        orderByVariable=NULL,
                        orderByValue=NULL,
                        posnBar="stack",
                        yLabel="# Hourly Observations",
                        legendLabel="",
                        facetOn=NULL
                        ) {

    # Get the levels to be used
    cLevels <- levels(df %>% pull(fillVar))
    
    # Create the main plot
    p1 <- df %>%
        ggplot(aes_string(fill=fillVar))
    if (!is.null(orderByVariable)) {
        p1 <- p1 + 
            geom_bar(aes(x=fct_reorder(get(xVar), get(orderByVariable)==orderByValue, .fun=sum)),
                     position=posnBar
                     )
    } else {
        p1 <- p1 + 
            geom_bar(aes_string(x=xVar), position=posnBar)
    }
    p1 <- p1 + 
        coord_flip() + 
        labs(x="", 
             y=yLabel, 
             title=title, 
             subtitle=subtitle
             ) + 
        theme(legend.position="bottom") + 
        scale_fill_discrete(legendLabel, rev(cLevels)) + 
        guides(fill=guide_legend(nrow=1))
    if (!is.null(facetOn)) {
        p1 <- p1 + facet_wrap(as.formula(paste("~", facetOn)))
    }
    print(p1)
    
}

# Cloud obscuration by source
plotMaxObsc(cloudSummary, 
            xVar="sourceName", 
            fillVar="obscType", 
            title="Maximum Cloud Obscuration", 
            orderByVariable="obscType",
            orderByValue="CLR"
            )

# Cloud obscuration by month
cloudSummary <- cloudSummary %>%
    mutate(month=lubridate::month(dtime), 
           hour=lubridate::hour(dtime), 
           monthfct=factor(month.abb[month], levels=month.abb[1:12])
           )
plotMaxObsc(cloudSummary, 
            xVar="monthfct", 
            fillVar="obscType", 
            title="Maximum Cloud Obscuration", 
            posnBar="fill"
            )

A few salient observations stand out about the maximum obscuration level:

  • Las Vegas is almost always clear or has just a few clouds up to 12,000 feet
  • Lincoln and Green Bay tend to be either clear up to 12,000 feet or to have an overcast
  • Traverse City is especially prone to being overcast and not having clear skied to 12,000 feet
  • Clear skies to 12,000 feet are roughly as likely in any month, but overcast by 12,000 feet is much more common in the cold season while FEW/SCT are more common during the warm season

Groups of cities can be examined, faceted by month:

cityCloudList <- list(c("klas_2016", "ksan_2016", "kiah_2016", "kmsy_2016"), 
                      c("kgrb_2016", "kgrr_2016", "kdtw_2016", "ktvc_2016"), 
                      c("klnk_2016", "kmsp_2016", "kmsn_2016", "kind_2016"), 
                      c("kmke_2016", "kord_2016", "kewr_2016")
                      )

for (x in cityCloudList) {
    cloudUse <- cloudSummary %>%
        filter(source %in% x)
    plotMaxObsc(cloudUse, 
                xVar="monthfct", 
                fillVar="obscType", 
                title="Maximum Cloud Obscuration", 
                facetOn="sourceName", 
                posnBar="fill"
                )
}

A few findings include:

  • While there may be some small seasonal patterns, observations in Las Vegas are almost always CLR/FEW
  • Houston and New Orleans both have few clear observations and few overcasts during June-September
  • San Diego has most of its overcasts during May-Septemver and most of its clear observations from October-April
  • Most of the remaining locales show a seasonal pattern with overcasts more common during the cold season
  • Newark is at a glance less seasonal than the midwestern locales

The ceiling heights can also be assessed:

cloudSummary <- cloudSummary %>%
    mutate(ceilFactor=factor(case_when(ceilingHeight == -100 ~ "None", 
                                       ceilingHeight <= 1000 ~ "0-1000", 
                                       ceilingHeight <= 3000 ~ "1000-3000", 
                                       ceilingHeight <= 6000 ~ "3000-6000",
                                       ceilingHeight <= 12000 ~ "6000-12000"
                                       ), 
                             levels=c("None", "6000-12000", "3000-6000", "1000-3000", "0-1000")
                             )
           )

plotMaxObsc(cloudSummary, 
            xVar="sourceName", 
            fillVar="ceilFactor", 
            title="Ceiling Height", 
            orderByVariable="ceilFactor",
            orderByValue="None"
            )

for (x in cityCloudList) {
    cloudUse <- cloudSummary %>%
        filter(source %in% x)
    plotMaxObsc(cloudUse, 
                xVar="monthfct", 
                fillVar="ceilFactor", 
                title="Ceiling Height (proportion of hourly observations)", 
                facetOn="sourceName", 
                posnBar="fill", 
                yLabel="", 
                legendLabel="Ceiling Height (feet)"
                )
}

Findings for ceiling height broadly line up with findings for maximum cloud obscuration, as expected:

  • Las Vegas is least likely to have a ceiling under 12,000 feet, followed by New Orleans, Lincoln, and Houston
  • San Diego is especially likely to have a ceiling of 1000-3000 feet relative to other locales, likely driven by marine layer

The same analysis can be run for the minimum cloud height:

cloudSummary <- cloudSummary %>%
    mutate(minCFactor=factor(case_when(cloudHeight == -100 ~ "None", 
                                       cloudHeight <= 1000 ~ "0-1000", 
                                       cloudHeight <= 3000 ~ "1000-3000", 
                                       cloudHeight <= 6000 ~ "3000-6000",
                                       cloudHeight <= 12000 ~ "6000-12000"
                                       ), 
                             levels=c("None", "6000-12000", "3000-6000", "1000-3000", "0-1000")
                             )
           )

plotMaxObsc(cloudSummary, 
            xVar="sourceName", 
            fillVar="minCFactor", 
            title="Minimum Cloud Height", 
            orderByVariable="minCFactor",
            orderByValue="None"
            )

for (x in cityCloudList) {
    cloudUse <- cloudSummary %>%
        filter(source %in% x)
    plotMaxObsc(cloudUse, 
                xVar="monthfct", 
                fillVar="minCFactor", 
                title="Minimum Cloud Height (proportion of hourly observations)", 
                facetOn="sourceName", 
                posnBar="fill", 
                yLabel="", 
                legendLabel="Minimum Cloud Height (feet)"
                )
}

At a glance, there seem to be some similarities and differences among the locales:

  • San Diego frequently has its lowest clouds in the 0-3000 feet range, particularly during the warm season
  • Las Vegas almost never has clouds below 6000 feet in any month
  • Houston and New Orleans look very similar in minimum cloud heights by month
  • Grand Rapids, Traverse City, Detroit, Indianapolis, Minneapolis, Chicago, Milwaukee, and Madison look similar in minimum cloud heights
  • Green Bay and Lincoln appear to have more occurences of clear skies up to 12000 feet than the other cold weather cities
  • Newark seems somewhat close to several different groupings of cities

A more analytical approach would look at the distance between the various locales. Broadly speaking, distance can be calculated based on counts by month by locale. No scaling is performed since each location has roughly the same number of observations by month, and the intent is to find macro similarities (e.g., if 1 city has 2% data in bucket x and the others all have 0% data in bucket x, this analysis would treat that as a minor difference where with scaled data it would be a primary difference driver).

There are three main variable to consider for distance:

  • obscType - the obscuration type observed
  • minCFactor - minimum cloud height observed, bucketed as 0-1000, 1000-3000, 3000-6000, 6000-12000, None
  • ceilFactor - ceiling height observed, bucketed as 0-1000, 1000-3000, 3000-6000, 6000-12000, None

Distances can be calculated using any or all of these variables. A function is created to that process can be repeated:

findCloudDist <- function(df, 
                          byVar, 
                          fctVar,
                          pivotVar="monthfct", 
                          scaleDistData=FALSE, 
                          returnPivotOnly=FALSE
                          ) {
    
    # FUNCTION ARGUMENTS
    # df: data frame or tibble containing one record per locale and time
    # byVar: the variable(s) by which the final distance file should be unique
    # fctVar: the factor variables to be counted up, with the sums being the distance inputs
    # pivotVar: the variable(s) which should be pivoted in to columns to make the file unique by byVar
    # scale: whether to scale the data prior to calculating distances
    # returnDistMatrix: whether to just return the pivoted data frame (default, FALSE, returns the distance matrix calculated from this pivoted data frame rather than the data frame itself)
    
    # Create the counts data by byVar, pivoted by fctVar and pivotVar
    baseData <- df %>%
        group_by_at(vars(all_of(c(byVar, fctVar, pivotVar)))) %>%
        summarize(n=n()) %>%
        ungroup() %>%
        pivot_wider(names_from=all_of(c(fctVar, pivotVar)), values_from="n") %>%
        mutate_if(is.numeric, tidyr::replace_na, replace=0)
    
    # If the data are only yo be pivoted, return and exit the function
    if (returnPivotOnly) {
        return(baseData)
    }
    
    # Split in to descriptors and data
    descData <- baseData %>%
        select_at(vars(all_of(byVar))) %>%
        mutate(rowN=row_number())
    distData <- baseData %>%
        select_at(vars(-any_of(byVar))) %>%
        as.matrix()
    
    # Scale distdata if requested
    if (scaleDistData) {
        distData <- scale(distData)
    }
    
    # Create the distances, convert back to data frame, pivot_longer, and attach the labels
    dist(distData) %>%
        as.matrix() %>%
        as_tibble() %>%
        mutate(row1=row_number()) %>%
        pivot_longer(-row1, names_to="row2", values_to="dist") %>%
        mutate(row2=as.integer(row2)) %>%
        inner_join(descData, by=c("row1"="rowN")) %>%
        rename_at(vars(all_of(byVar)), ~paste0(., "_1")) %>%
        inner_join(descData, by=c("row2"="rowN")) %>%
        rename_at(vars(all_of(byVar)), ~paste0(., "_2"))
    
}

The function is then run for the obscuration data as an example:

# Find the distances for obscType by locale vs. locale
obscDist <- findCloudDist(cloudSummary, 
                          byVar=c("source", "sourceName"), 
                          fctVar="obscType", 
                          pivotVar="monthfct"
                          )
obscDist
## # A tibble: 225 x 7
##     row1  row2  dist source_1  sourceName_1       source_2 sourceName_2         
##    <int> <int> <dbl> <chr>     <chr>              <chr>    <chr>                
##  1     1     1    0  kdtw_2016 Detroit, MI (2016) kdtw_20~ Detroit, MI (2016)   
##  2     1     2  627. kdtw_2016 Detroit, MI (2016) kewr_20~ Newark, NJ (2016)    
##  3     1     3  669. kdtw_2016 Detroit, MI (2016) kgrb_20~ Green Bay, WI (2016) 
##  4     1     4  294. kdtw_2016 Detroit, MI (2016) kgrr_20~ Grand Rapids, MI (20~
##  5     1     5  726. kdtw_2016 Detroit, MI (2016) kiah_20~ Houston, TX (2016)   
##  6     1     6  347. kdtw_2016 Detroit, MI (2016) kind_20~ Indianapolis, IN (20~
##  7     1     7 1421. kdtw_2016 Detroit, MI (2016) klas_20~ Las Vegas, NV (2016) 
##  8     1     8 1092. kdtw_2016 Detroit, MI (2016) klnk_20~ Lincoln, NE (2016)   
##  9     1     9  361. kdtw_2016 Detroit, MI (2016) kmke_20~ Milwaukee, WI (2016) 
## 10     1    10  345. kdtw_2016 Detroit, MI (2016) kmsn_20~ Madison, WI (2016)   
## # ... with 215 more rows

A function can then be created to plot the distances as a heamap and to return the minimum distance for each locale:

plotCloudDist <- function(df, 
                          var1="sourceName_1", 
                          var2="sourceName_2", 
                          met="dist", 
                          roundDist=0, 
                          title="Distance Between Locales", 
                          subT=""
                          ) {
    
    # FUNCTION ARGUMENTS:
    # df: tibble or data frame containing distance data
    # var1: the variable containing the first locale
    # var2: the variable containing the second locale
    # dist: the variable containing the pre-calculated distance between var1 and var2
    # roundDist: the rounding for the distance in the plot

    # Process the data frame and exclude any occurences of distance to self    
    distData <- df %>%
        select_at(vars(all_of(c(var1, var2, met)))) %>%
        filter(get(var1) != get(var2))
    
    # Get the locales by minimum distance to any other locale
    distHiLo <- distData %>%
        group_by_at(vars(all_of(var1))) %>%
        filter(dist==min(dist)) %>%
        arrange(dist) %>%
        pull(var1)
    
    # Create a heatmap of the distances
    distData %>%
        ggplot(aes(x=factor(get(var1), levels=distHiLo), y=factor(get(var2), levels=distHiLo))) + 
        geom_tile(aes(fill=dist)) + 
        geom_text(aes(label=round(dist, roundDist)), color="lightblue") +
        labs(x="", y="", title=title, subtitle=subT) + 
        scale_fill_gradient("Distance", low="black", high="white") + 
        theme(axis.text.x=element_text(angle=90))
    
}

The process can then be run on the obscuration data as a check:

plotCloudDist(obscDist, subT="Based on % of obscuration type by month")

FINDINGS FOR OBSCURATION DISTANCE:

  • Las Vegas is far from everything, with its only modestly near neighbor being Lincoln
  • Lincoln is far from everything except for its moderately near neighbors Las Vegas and Green Bay
  • San Diego is far from everything except its moderately near neighbors Houston, New Orleans, and Newark
  • Green Bay is closest to Minneapolis and is moderately close to most of the midwest cities
  • Newark is reasonably close to many cities, being distant only from Traverse City, Green Bay, Lincoln, and Las Vegas
  • Houston and New Orleans are closest to each other, then to Newark
  • Traverse City and Grand Rapids are relatively close to each other
  • The remaining cold weather cities (Milwaukee, Chicago, Madison, Minneapolis, Indianapolis, Detroit) are all relatively close to each other and also to Grand Rapids

This is suggestive that there is a cold-weather cluster, a hot and humid cluster, and then several locales that are more or less standalone but could be grouped loosely to each other.

The functions can then be run on variations of the data:

# Run for minCFactor
findCloudDist(cloudSummary, 
              byVar=c("source", "sourceName"), 
              fctVar="minCFactor", 
              pivotVar="monthfct"
              ) %>%
    plotCloudDist(subT="Based on % in each minimum cloud height bucket by month")

# Run for ceilFactor
findCloudDist(cloudSummary, 
              byVar=c("source", "sourceName"), 
              fctVar="ceilFactor", 
              pivotVar="monthfct"
              ) %>%
    plotCloudDist(subT="Based on % in each ceiling height bucket by month")

Findings by minimum cloud height are similar to findings by obscuration.

Findings by ceiling height show that Lincoln and Newark are close to each other and that San Diego and Las Vegas are both segments of one. Traverse City and Grand Rapids are close to each other; as are New Orleans and Houston. There continues to be a large, close, segment of cold cities.

Of interest is whether a simple kmeans analysis returns the same findings:

set.seed(2006040940)

ceilDistData <- findCloudDist(cloudSummary, 
                              byVar=c("source", "sourceName"), 
                              fctVar="ceilFactor", 
                              pivotVar="monthfct", 
                              returnPivotOnly=TRUE
                              )

tibble::tibble(locale=ceilDistData$sourceName, 
               cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=5, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Grand Rapids, MI (2016)        5
##  2 Traverse City, MI (2016)       5
##  3 Las Vegas, NV (2016)           4
##  4 Newark, NJ (2016)              3
##  5 Houston, TX (2016)             3
##  6 Lincoln, NE (2016)             3
##  7 New Orleans, LA (2016)         3
##  8 Detroit, MI (2016)             2
##  9 Green Bay, WI (2016)           2
## 10 Indianapolis, IN (2016)        2
## 11 Milwaukee, WI (2016)           2
## 12 Madison, WI (2016)             2
## 13 Minneapolis, MN (2016)         2
## 14 Chicago, IL (2016)             2
## 15 San Diego, CA (2016)           1

The five-segment k-means is consistent - solo segments for San Diego and Las Vegas; a Grand Rapids-Traverse City segment; a cold-weather midwestern segment (Grand Rapids and Traverse City are on the downwind side of Lake Michigan and would be expected to see much different clouds than cities with similar temperatures); and a catch-all segment.

The data can also be assessed using hierarchical clustering:

hclust(dist(ceilDistData[3:ncol(ceilDistData)]), method="complete") %>%
    plot(labels=ceilDistData$sourceName, cex=0.5, main="Hierarchical on Ceiling Height: method=complete")

hclust(dist(ceilDistData[3:ncol(ceilDistData)]), method="single") %>%
    plot(labels=ceilDistData$sourceName, cex=0.5, main="Hierarchical on Ceiling Height: method=single")

Similar conclusions can be drawn from hierarchical clustering based on ceiling height buckets by month:

  • San Diego and Las Vegas are each segments of one
  • Houston and New Orleans tend to associate together
  • Grand Rapids and Traverse City tend to associate together
  • Lincoln and Newark tend to associate together
  • There are possibly two sub-groupings of the other cold weather locales

This is suggestive of either 6 or 7 segments in this data. Does k-means match up?

tibble::tibble(locale=ceilDistData$sourceName, 
               cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=6, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Las Vegas, NV (2016)           6
##  2 San Diego, CA (2016)           5
##  3 Newark, NJ (2016)              4
##  4 Lincoln, NE (2016)             4
##  5 Houston, TX (2016)             3
##  6 New Orleans, LA (2016)         3
##  7 Grand Rapids, MI (2016)        2
##  8 Traverse City, MI (2016)       2
##  9 Detroit, MI (2016)             1
## 10 Green Bay, WI (2016)           1
## 11 Indianapolis, IN (2016)        1
## 12 Milwaukee, WI (2016)           1
## 13 Madison, WI (2016)             1
## 14 Minneapolis, MN (2016)         1
## 15 Chicago, IL (2016)             1
tibble::tibble(locale=ceilDistData$sourceName, 
               cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=7, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Houston, TX (2016)             7
##  2 New Orleans, LA (2016)         7
##  3 Detroit, MI (2016)             6
##  4 Green Bay, WI (2016)           6
##  5 Minneapolis, MN (2016)         6
##  6 Newark, NJ (2016)              5
##  7 Lincoln, NE (2016)             5
##  8 Grand Rapids, MI (2016)        4
##  9 Traverse City, MI (2016)       4
## 10 Indianapolis, IN (2016)        3
## 11 Milwaukee, WI (2016)           3
## 12 Madison, WI (2016)             3
## 13 Chicago, IL (2016)             3
## 14 San Diego, CA (2016)           2
## 15 Las Vegas, NV (2016)           1

Very nicely aligned. The addition of the sixth segment splits the catch-all segment (from the previous 5 cluster analysis) in to a Houston-New Orleans segment and a Lincoln-Newark segment. The addition of the seventh segment splits the cold weather cities in to Chicago-Madison-Milwukee-Indianapolis; and Detroit-Green Bay-Minneapolis.

Are any differences observed in trends for minimum cloud height?

heightDistData <- findCloudDist(cloudSummary, 
                                byVar=c("source", "sourceName"), 
                                fctVar="minCFactor", 
                                pivotVar="monthfct", 
                                returnPivotOnly=TRUE
                                )

hclust(dist(heightDistData[3:ncol(heightDistData)]), method="complete") %>%
    plot(labels=heightDistData$sourceName, cex=0.5, 
         main="Hierarchical on Minimum Cloud Height: method=complete"
         )

hclust(dist(heightDistData[3:ncol(heightDistData)]), method="single") %>%
    plot(labels=heightDistData$sourceName, cex=0.5, 
         main="Hierarchical on Minimum Cloud Height: method=single"
         )

There are some differences when using the ‘complete’ linkage method:

  • There is a Green Bay-Lincoln cluster, with Las Vegas merging in relatively soon
  • There are arguably two cold weather clusters (including Grand Rapids and Traverse City), with Newark merging in relatively soon
  • There is a Houston-New Orleans segment, with San Diego merging in relatively soon

When using the ‘single’ linkage method, these patterns become muted:

  • There are still arguably two cold-weather clusters
  • Houston and New Orleans form a cluster
  • San Diego and Las Vegas are their own cluster
  • Lincoln, Newark, and Green Bay all fall somewhere in the middle

We would expect k-means to pull out 1-2 cold-weather clusters and a Houston-New Orleans cluster:

set.seed(2006041003)

tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=5, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Las Vegas, NV (2016)           5
##  2 Lincoln, NE (2016)             5
##  3 Houston, TX (2016)             4
##  4 New Orleans, LA (2016)         4
##  5 San Diego, CA (2016)           3
##  6 Detroit, MI (2016)             2
##  7 Grand Rapids, MI (2016)        2
##  8 Indianapolis, IN (2016)        2
##  9 Milwaukee, WI (2016)           2
## 10 Madison, WI (2016)             2
## 11 Minneapolis, MN (2016)         2
## 12 Chicago, IL (2016)             2
## 13 Traverse City, MI (2016)       2
## 14 Newark, NJ (2016)              1
## 15 Green Bay, WI (2016)           1
tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=6, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Houston, TX (2016)             6
##  2 New Orleans, LA (2016)         6
##  3 Newark, NJ (2016)              5
##  4 Green Bay, WI (2016)           5
##  5 San Diego, CA (2016)           4
##  6 Detroit, MI (2016)             3
##  7 Grand Rapids, MI (2016)        3
##  8 Indianapolis, IN (2016)        3
##  9 Milwaukee, WI (2016)           3
## 10 Madison, WI (2016)             3
## 11 Minneapolis, MN (2016)         3
## 12 Chicago, IL (2016)             3
## 13 Traverse City, MI (2016)       3
## 14 Lincoln, NE (2016)             2
## 15 Las Vegas, NV (2016)           1
tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=7, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Indianapolis, IN (2016)        7
##  2 Milwaukee, WI (2016)           7
##  3 Madison, WI (2016)             7
##  4 Minneapolis, MN (2016)         7
##  5 Chicago, IL (2016)             7
##  6 Lincoln, NE (2016)             6
##  7 Newark, NJ (2016)              5
##  8 Green Bay, WI (2016)           5
##  9 Las Vegas, NV (2016)           4
## 10 Detroit, MI (2016)             3
## 11 Grand Rapids, MI (2016)        3
## 12 Traverse City, MI (2016)       3
## 13 Houston, TX (2016)             2
## 14 New Orleans, LA (2016)         2
## 15 San Diego, CA (2016)           1
tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=8, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 San Diego, CA (2016)           8
##  2 Indianapolis, IN (2016)        7
##  3 Milwaukee, WI (2016)           7
##  4 Madison, WI (2016)             7
##  5 Minneapolis, MN (2016)         7
##  6 Chicago, IL (2016)             7
##  7 Houston, TX (2016)             6
##  8 New Orleans, LA (2016)         6
##  9 Las Vegas, NV (2016)           5
## 10 Detroit, MI (2016)             4
## 11 Grand Rapids, MI (2016)        4
## 12 Traverse City, MI (2016)       4
## 13 Newark, NJ (2016)              3
## 14 Green Bay, WI (2016)           2
## 15 Lincoln, NE (2016)             1

With five segments, there is:

  • San Diego
  • Houston and New Orleans
  • Newark and Green Bay
  • Las Vegas and Lincoln
  • Cold weather cities

Adding the sixth cluster splits apart Las Vegas and Lincoln. Adding the seventh cluster cleaves Grand Rapids-Traverse City-Detroit from the other cold weather cities. Adding the eighth cluster splits apart Green Bay and Newark.

Precipitation

METAR data include descriptions of the precipitation occuring at any given time. Two of the most common precipitation forms are rain (RA) and SN(). These can occur together, denoted as RASN or SNRA in the METAR.

Further, the precipitation type can be classified using a prefix as light (-), moderate (no prefix), or heavy (+). So, RA would be moderate rain, -SNRA would be a light snow-rain mix, +RA would be heavy rain.

Additionally, the timing of the precipitation event is captured in the remarks using B (begin) and E (end). So, an hourly METAR of RAB20E35B50 would mean rain started at 20 past the hour, ended at 35 past the hour, and began again at 50 past the hour. Since METAR are often taken just before the top of the hour, a four-digit time is used if it is in the ‘previous’ hour; for example, RAB1959E36 in the 2053Z METAR.

Broadly speaking, precipitation at a specific point in time can be extracted from the main METAR (preceding RMK) whil prcipitation intervals can be extracted by parsing the remarks (following RMK). There can be misalignment between these; for example, an interval that suggests rain is occuring at a time the METAR does not show rain.

There are two key functions for the extraction process:

  • suggestBeginEndTimes() - a function that compares the METAR and remarks and suggests precipitation begin timea that should be added or excluded, and precipitation end times that should be added or excluded
  • runFullPrecipExtraction() - a function that extracts relevant precipitation information, with options for begin/end times to be added or excluded

Helper function include:

  • extractPrecipData() - extracts the precipitation begin and end times from a processed METAR file
  • getBeginendTimeMatrix() - get the beginning and ending times as a matrix
  • getBeginEndTimeVector() - extract the beginning or ending times from a column of the matrix
  • extractTime() - extract the time as a zero-padded number from a named vector
  • createPrecipInterval() - create the precipitation time intervals, and flag any that are negative or very long
  • createPrecipIntervalPlots() - distributions of precipitation events, precipitation by day, and precipitation by month

Example code includes:

# Extract the precipitation begin and end times from a processed METAR file
extractPrecipData <- function(df, 
                              pType, 
                              showRegex=TRUE, 
                              showSTR=FALSE, 
                              showTable=FALSE,
                              dupError=TRUE
                              ) {

    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble
    # pType: the regex code to be used
    # showRegex: boolean, whether to print out the regex code used
    # showSTR: boolean, whether to show the file str()
    # showTable: boolean, whether to show the file counts of start and end times
    # dupError: boolean, if TRUE will throw an error if any duplicates are found; if FALSE just notes it
    
    # Extract the METAR and the date-time from a processed list
    procMET <- df %>% 
        select(origMETAR, dtime)

    # Check whether there are comments for the desired pType either beginning or ending
    keyPattern <- paste0("(", pType, "[B|E]\\d+[0-9BE]*)")
    if (showRegex) { cat("\nRegex search code is:", keyPattern, "\n\n") }
    
    procMET <- procMET %>% 
        mutate(precipData=str_extract(origMETAR, pattern=keyPattern), 
               isPrecip=grepl(paste0(pType, "[B|E]"), origMETAR, perl=TRUE), 
               nBegin=pmax(0, str_count(precipData, "B"), na.rm=TRUE), 
               nEnd=pmax(0, str_count(precipData, "E"), na.rm=TRUE), 
               dateUTC=lubridate::date(dtime), 
               hourUTC=lubridate::hour(dtime)
               )
    if (showSTR) {
        str(procMET)
        cat("\n")
    }

    # Check the counts of precipitation beginning and rain ending
    if (showTable) {
        procMET %>%
            count(isPrecip, nBegin, nEnd) %>%
            print()
    }

    # Check that the file is unique by time
    dupAns <- procMET %>%
        select(dateUTC, hourUTC) %>%
        duplicated %>%
        any()

    if (dupError) {
        if (dupAns) { stop("Duplicates observed, investigate and fix\n") }
    } else {
        cat("\nAre there any problems with duplicated keys?", dupAns, "\n\n")
    }
    
    procMET
    
}


# Helper function to extract the beginning and ending times using str_extract_all
getBeginEndTimeMatrix <- function(file, 
                                  pullVar="precipData", 
                                  pState="B"
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # file: file containing data
    # pullVar: the variable of interest in file
    # pState: the precipitation state of interest - B (begin) or E (end)
    
    file %>%
        pull(pullVar) %>%
        str_extract_all(paste0(pState, "\\d+"), simplify=TRUE) %>%
        as.data.frame(stringsAsFactors=FALSE)
    
}


getBeginEndTimeVector <- function(timeExtractFile, 
                                  origFullFile, 
                                  extractVar, 
                                  extractSym
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # timeExtractFile: the file containing the extracted begin and end times
    # origFullFile: the file that the extraction was run from (includes date and time)
    # extractVar: the variable name for the column to be extracted from
    # extractSym: whether it is B (begin) or E (end)
    
    timeExtractFile %>%
        cbind(origFullFile[, c("dateUTC", "hourUTC")]) %>%
        mutate(hourChar=str_pad(str_trim(as.character(hourUTC)), width=2, pad="0")) %>%
        apply(1, FUN=extractTime, var=extractVar, sym=extractSym)
    
}


# Convert the rain begins data to the hour and minute associated to the UTC
extractTime <- function(x, 
                        var, 
                        sym="B"
                        ) {
    
    # FUNCTION ARGUMENTS:
    # x: the named vector
    # var: the variable name to extract
    # sym: whether the suppression needs to be on B (begin) or E (end) to convert to a zero-padded number
    
    # Extracting the time is tricky
    # If NA or blank, then the time is NA
    # Replace the leading B or E and there should then be a 2-digit or 4-digit number
    # If 2-digit, then append date amd hour in front of it
    # If 4-digit, then append date in front of it
    # Exception: if 4-digit and begins with 23, then need to use the previous day
    
    if (is.na(x[var]) | x[var]=="") {
        utcReturn <- NA
    }
    else {
        utcUse <- str_replace(x[var], sym, "")
        if (str_length(utcUse)==4) {
            utcReturn <- paste0(x["dateUTC"], " ", utcUse)
            # If a 4-digit time starts with 23 and is in the 0Z METAR, it is part of the previous day
            if(str_sub(utcUse, 1, 2)=="23" & as.numeric(x["hourChar"])==0) {
                utcReturn <- paste0(as.Date(x["dateUTC"]) - 1, " ", utcUse)
            }
        } else if (str_length(utcUse)==2) {
            utcReturn <- paste0(x["dateUTC"], " ", x["hourChar"], utcUse)
        } else {
            cat("\nCannot parse data: ", x, "\n", x[var], "\n", var, sym, utcUse)
            stop()
        }
    }
    utcReturn

}


suggestBeginEndTimes <- function(df, 
                                 regMatch, 
                                 nColMax=3, 
                                 printNACols=FALSE
                                 ) {

    # FUNCTION ARGUMENTS
    # df: the data frame to be examined
    # regMatch: the relevant regex extraction code
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # printNACols: boolean, whether to print the NA column sums
    
    # Pull the data and check for the specified precipitation pattern and lags
    # Needs to follow the \\d{6}Z that shows the time and precede the RMK that denotes the remarks
    sugStates <- df %>%
        select(dtime, origMETAR) %>%
        mutate(curPrecip=str_detect(origMETAR, paste0("\\d{6}Z.*", regMatch, ".*RMK")), 
               lagPrecip=lag(curPrecip, 1)
               )

    # Use the analysis data to look for begins and ends flagged in the remarks
    sugBE <- extractPrecipData(sugStates, pType=regMatch)

    # Inner join the data by dtime
    sugBEJoin <- sugBE %>% 
        select(dtime, precipData, chgPrecip=isPrecip, dateUTC, hourUTC)
    sugStates <- sugStates %>%
        inner_join(sugBEJoin, by="dtime")

    # Get the beginning and end times data for the desired precipitation type
    sugBegin <- getBeginEndTimeMatrix(sugStates, pState="B")
    sugEnd <- getBeginEndTimeMatrix(sugStates, pState="E")

    # Ensure that user-defined maximum number of columns is not exceeded
    if (ncol(sugBegin) > nColMax | ncol(sugEnd) > nColMax) { 
        stop("Function argument set to allow only for 0 - ", nColMax, "columns, investigate and fix")
    }
    
    # For each iteration, create testBTn and testETn (use NA if no column)
    for (ctr in 1:nColMax) {
        
        # Create NA as the baseline
        assign(paste0("testBT", ctr), NA)
        assign(paste0("testET", ctr), NA)
        
        # Assign begin if exists
        if (ncol(sugBegin) >= ctr) {
            bt <- getBeginEndTimeVector(sugBegin, sugStates, extractVar=paste0("V", ctr), extractSym="B")
            assign(paste0("testBT", ctr), bt)
        }
        
        # Assign end if exists
        if (ncol(sugEnd) >= ctr) {
            et <- getBeginEndTimeVector(sugEnd, sugStates, extractVar=paste0("V", ctr), extractSym="E")
            assign(paste0("testET", ctr), et)
        }
        
    }
    
    # Integrate to a single file, allowing for nMaxCol applications of variables
    # Create file sugExceptions as sugStates and initialize the columns for begins and ends
    sugExceptions <- sugStates %>%
        mutate(begins=0, ends=0)
    
    # Loop through up to nColMax, adding the appropriate bn=testBTn and en=testETn
    for (ctr in 1:nColMax) {
        sugExceptions <- sugExceptions %>%
            mutate(!!paste0("b", ctr) := get(paste0("testBT", ctr)), 
                   !!paste0("e", ctr) := get(paste0("testET", ctr)), 
                   begins=begins + ifelse(is.na(get(paste0("b", ctr))), 0, 1),
                   ends=ends + ifelse(is.na(get(paste0("e", ctr))), 0, 1)
                   )
    }
    
    # Create the final version by checking key criteria
    # Add a criteria for b before e if previous; and e before b if not previous?
    sugExceptions <- sugExceptions %>%
        mutate(etob=begins > ends, 
               btoe=ends > begins, 
               needBegin=curPrecip & !lagPrecip & !etob, 
               needEnd=!curPrecip & lagPrecip & !btoe, 
               overBegin=etob & (lagPrecip | !curPrecip), 
               overEnd=btoe & (curPrecip | !lagPrecip)
               )
    
    # Report on the NA status by column
    if (printNACols) {
        colSums(is.na(sugExceptions)) %>% 
            print()
    }

    # Flag potential issues
    cat("\nNeed Begin time\n")
    sugExceptions %>%
        filter(needBegin) %>%
        select(dtime, origMETAR) %>%
        print()

    cat("\nNeed End time\n")
    sugExceptions %>%
        filter(needEnd) %>%
        select(dtime, origMETAR) %>%
        print()

    cat("\nExtraneous Begin time\n")
    sugExceptions %>%
        filter(overBegin) %>%
        select(dtime, paste0("b", 1:nColMax)) %>%
        print()

    cat("\nExtraneous End time\n")
    sugExceptions %>%
        filter(overEnd) %>%
        select(dtime, paste0("e", 1:nColMax)) %>%
        print()

    cat("\nWrong amount of begins or ends\n")
    sugExceptions %>%
        mutate(absMatch=abs(begins-ends), absLag=lag(absMatch, 1), absLead=lead(absMatch, 1)) %>%
        filter(pmax(absMatch, absLag, absLead) > 1) %>%
        select(dtime, paste0("e", 1:nColMax), paste0("b", 1:nColMax)) %>%
        print()
    
    sugExceptions %>%
        filter(needBegin | needEnd | overBegin | overEnd | abs(begins - ends) > 1)
    
}

A function is then created to run the full precipitation extraction:

# Function to create the time intervals data
createPrecipInterval <- function(endVector, 
                                 beginVector, 
                                 endExclude=c(), 
                                 beginExclude=c(), 
                                 sState=FALSE, 
                                 nMinPrint=1, 
                                 maxProb=1000, 
                                 nMaxPrint=1
                                 ) {
    
    # FUNCTION ARGUMENTS:
    # endVector: vector of end times
    # beginVector: vector of begin times
    # endExclude: vector of end times to exclude
    # beginExclude: vector of begin times to exclude
    # sState: the starting precipitation state (TRUE for precipitation., FALSE for no precipitation)
    # nMinPrint: index for the first position of negative intervals to show (typically 1)
    # maxProb: maximum duration (minutes) of precipitation to flag a print out
    # nMaxPrint: index for the first position of very long intervals to show (typically 1)
    
    # If the starting state is one of precipitation, allow it to 'burn in' by deleting the first end time
    if(sState) { endVector <- endVector[2:length(endVector)] }
    
    # Create the interval data
    endsUse <- endVector[!(endVector %in% lubridate::ymd_hm(endExclude))]
    beginsUse <- beginVector[!(beginVector %in% lubridate::ymd_hm(beginExclude))]
    intervalData <- endsUse - beginsUse
    
    # Show a summary of the interval data
    print(summary(as.numeric(intervalData)))
    
    # If there are any non-positive intervals, print the data causing the first of them
    if (min(as.numeric(intervalData)) <= 0) {
        cat("\nProblem Detected - Intervals are not positive.  Data to help investigate\n")
        posns <- which(as.numeric(intervalData) <= 0)
        posns <- max(1, posns[1]-5):min(length(beginsUse), length(endsUse), posns[nMinPrint]+5)
        cat("\nVector of Begins\n")
        print(lubridate::as_datetime(beginsUse[posns]))
        cat("\nVector of Ends\n")
        print(lubridate::as_datetime(endsUse[posns]))
        cat("\n")
    }
    
    # If there are any very long positive intervals, print the data causing the first five of them
    if (max(as.numeric(intervalData)) >= maxProb) {
        cat("\nPotential problem Detected - very long.  Data to help investigate\n")
        posns <- which(as.numeric(intervalData) >= maxProb)
        cat("\nPositions with problems are:", posns)
        posns <- max(1, posns[1]-5):min(length(beginsUse), length(endsUse), posns[min(length(posns), nMaxPrint)]+5)
        cat("\nVector of Begins\n")
        print(lubridate::as_datetime(beginsUse[posns]))
        cat("\nVector of Ends\n")
        print(lubridate::as_datetime(endsUse[posns]))
        cat("\n")
    }
    
    # Return the interval data
    intervalData
    
}



createPrecipIntervalPlots <- function(intervalData, 
                                      beginsData, 
                                      titleText, 
                                      yAxisText, 
                                      beginExclude=c(), 
                                      returnPlotsAndData=FALSE
                                      ) {

    # FUNCTION ARGUMENTS:
    # intervalData: the precipitation interval data
    # beginsData: the precipitation begin times
    # titleText: plot title
    # yAxisText: plot y-axis
    # beginExclude: begin times to exclude
    # returnPlotsAndData: whether to return the plot objects
    
    # Exclude any data from begins as needed
    beginsData <- beginsData[!(beginsData %in% lubridate::ymd_hm(beginExclude))]
    
    # Create a plotting data frame
    histFrame <- data.frame(minutesPrecip=as.numeric(intervalData), 
                            month=lubridate::month(beginsData), 
                            rainDate=lubridate::date(beginsData)
                            ) %>%
        mutate(hoursPrecip=minutesPrecip/60)
    
    # Plot the precipitation durations in hours
    p1 <- histFrame %>%
        ggplot(aes(x=hoursPrecip)) +
        geom_histogram() + 
        labs(title=titleText, x=yAxisText, 
             subtitle="Distribution of hours per unique precipitation event"
             )
    print(p1)

    # Plot the precipitation by day in hours
    p2 <- histFrame %>%
        group_by(rainDate) %>%
        summarize(hoursPrecip=sum(hoursPrecip)) %>%
        ggplot(aes(x=hoursPrecip)) +
        geom_histogram() + 
        labs(title=titleText, x=yAxisText, 
             subtitle="Distribution of hours per day of precipitation (on days when 1+ minutes occurred)"
             )
    print(p2)
    
    # Plot the rain totals (in hours) by month
    # Create a data frame of all months and merge in precipitation data as an where available (0 otherwise)
    monthFrame <- histFrame %>%
        group_by(month) %>%
        summarize(minutesPrecip=sum(minutesPrecip), hoursPrecip=sum(hoursPrecip), nPrecip=n()) %>%
        right_join(data.frame(month=1:12, monthName=month.abb[1:12]), by="month") %>%
        tidyr::replace_na(list(minutesPrecip=0, hoursPrecip=0, nPrecip=0))
    # print(monthFrame)
    
    p3 <- monthFrame %>%
        ggplot(aes(x=factor(monthName, levels=month.abb[1:12]), y=hoursPrecip)) +
        geom_col() + 
        labs(title=titleText, y=yAxisText, x="") + 
        geom_text(aes(y=2 + hoursPrecip, label=round(hoursPrecip, 1)))
    print(p3)
    
    if (returnPlotsAndData) {
        list(histFrame=histFrame, monthFrame=monthFrame, p1=p1, p2=p2, p3=p3)
    } else {
        NULL
    }
}



# Precipitation regex to name mapper
precipMapper <- c("(?<!BL)SN"="Snowfall", 
                  "(?<!FZ)RA"="Rainfall", 
                  "(?<!VC)TS"="Thunderstorm"
                  )



# Combining all the functions in one place
runFullPrecipExtraction <- function(df, 
                                    pType, 
                                    titleText=NULL, 
                                    yAxisText=NULL,
                                    endExclude=c(), 
                                    beginExclude=c(), 
                                    endAdd=c(),
                                    beginAdd=c(),
                                    maxProb=1000, 
                                    sState=FALSE, 
                                    makePlots=TRUE, 
                                    returnPlotsAndData=FALSE, 
                                    axisMapper=precipMapper,
                                    titleMapper=cityNameMapper, 
                                    nColMax=3, 
                                    frameName=deparse(substitute(df))
                                    ) {
    
    # FUNCTION ARGUMENTS
    # df: the data frame or tibble
    # ptype: the regex for the precipitation type of interest
    # titleText: title text for the associated plot
    # yAxisText: y-axis text for the associated plot
    # endExclude: vector of end times to be excluded, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # beginExclude: vector of begin times to be excluded, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # endAdd: vector of end times to be added, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # beginAdd: vector of begin times to be added, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # maxProb: length of precipitation (in minutes) beyond which a note is thrown
    # sState: whether the starting state is with precipitation
    # makePlots: whether to create plots
    # returnPlotsAndData: whether to return the plot object with the other data
    # axisMapper: mapping file to create y-axis names from pType
    # titleMapper: mapping file to create the title names from df
    # nColMax: the maximum allowed number of columns for each of precipitation begin and end times
    
    # Create the yAxisText if it has not been provided
    if (is.null(yAxisText)) {
        yAxisText <- paste0("Hours of ", axisMapper[pType])
    }
    
    # Create the title if it has not been provided
    if (is.null(titleText)) {
        titleText <- paste0(titleMapper[frameName], " - Hours of ", axisMapper[pType])
    }
    
    # Extract the precipitation data from a specified processed METAR file
    testFileProc <- extractPrecipData(df, pType=pType)
    
    # Confirm that the two-column specification is met (should relax hard-coding on this)
    testBegin <- getBeginEndTimeMatrix(testFileProc, pState="B")
    testEnd <- getBeginEndTimeMatrix(testFileProc, pState="E")

    if ((ncol(testBegin) > nColMax) | ((ncol(testEnd) > nColMax))) { 
        cat("\nBegin columns:", ncol(testBegin), "\t\tEndcolumns:", ncol(testEnd))
        stop("Hard-coded for at most", nColMax, "columns each of begin/end- Fix")
    }

    # Initialize testAllBegins and testAllEnds
    testAllBegins <- beginAdd
    testAllEnds <- endAdd
    
    # For each iteration, append to testAllBegins and testAllEnds as appropriate
    for (ctr in 1:nColMax) {
        
        # Append begin if exists
        if (ncol(testBegin) >= ctr) {
            bt <- getBeginEndTimeVector(testBegin, testFileProc, extractVar=paste0("V", ctr), extractSym="B")
            testAllBegins <- c(testAllBegins, bt[!is.na(bt)])
        }
        
        # Assign end if exists
        if (ncol(testEnd) >= ctr) {
            et <- getBeginEndTimeVector(testEnd, testFileProc, extractVar=paste0("V", ctr), extractSym="E")
            testAllEnds <- c(testAllEnds, et[!is.na(et)])
        }
        
    }
    
    # Convert to date-time and sort
    testAllBegins <- testAllBegins %>%
        lubridate::ymd_hm() %>%
        sort()

    testAllEnds <- testAllEnds %>%
        lubridate::ymd_hm() %>%
        sort()
    
    # Create the intervals
    testIntervals <- createPrecipInterval(testAllEnds, 
                                          testAllBegins, 
                                          endExclude=endExclude, 
                                          beginExclude=beginExclude, 
                                          maxProb=maxProb, 
                                          sState=sState
                                          )
    
    # Create the precipitation plots
    plotOut <- NULL
    if (makePlots) {
        plotOut <- createPrecipIntervalPlots(testIntervals, 
                                             testAllBegins, 
                                             titleText=titleText, 
                                             yAxisText=yAxisText, 
                                             beginExclude=beginExclude, 
                                             returnPlotsAndData=returnPlotsAndData
                                             )
    }
    
    if (!returnPlotsAndData) { plotOut <- NULL }
    
    # Return all of the key files, along with the parameters used
    keyParams <- list(fileName=frameName, 
                      pType=pType, 
                      endExclude=endExclude, 
                      beginExclude=beginExclude, 
                      endAdd=endAdd, 
                      beginAdd=beginAdd,
                      maxProb=maxProb, 
                      sState=sState
                      )
    list(keyParams=keyParams, 
         testFileProc=testFileProc, 
         testAllBegins=testAllBegins[!(testAllBegins %in% lubridate::ymd_hm(beginExclude))], 
         testAllEnds=testAllEnds[!(testAllEnds %in% lubridate::ymd_hm(endExclude))], 
         testIntervals=testIntervals,
         plotOut=plotOut
         )
}

The functions are then run, using the Chicago 2016 data:

# Run for Chicago 2016, extracting rain that is not freezing rain
kord2016ExceptRA <- suggestBeginEndTimes(kord_2016, regMatch="(?<!FZ)RA")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 2 2016-08-29 19:51:00 KORD 291951Z 25009KT 1SM R10L/1200VP6000FT +TSRA BKN041CB~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:51:00 KORD 131851Z 22009G18KT 10SM FEW027 SCT047 BKN250 28/22 A~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
kord2016ExceptRA
## # A tibble: 3 x 22
##   dtime               origMETAR curPrecip lagPrecip precipData chgPrecip
##   <dttm>              <chr>     <lgl>     <lgl>     <chr>      <lgl>    
## 1 2016-03-24 21:51:00 KORD 242~ TRUE      FALSE     <NA>       FALSE    
## 2 2016-07-13 18:51:00 KORD 131~ FALSE     TRUE      <NA>       FALSE    
## 3 2016-08-29 19:51:00 KORD 291~ TRUE      FALSE     <NA>       FALSE    
## # ... with 16 more variables: dateUTC <date>, hourUTC <int>, begins <dbl>,
## #   ends <dbl>, b1 <chr>, e1 <chr>, b2 <chr>, e2 <chr>, b3 <lgl>, e3 <lgl>,
## #   etob <lgl>, btoe <lgl>, needBegin <lgl>, needEnd <lgl>, overBegin <lgl>,
## #   overEnd <lgl>
# Run extraction for Chicago 2016
# Attempt to use on the kord2016 rain data
kordRain2016Test <- runFullPrecipExtraction(kord_2016, 
                                            pType="(?<!FZ)RA", 
                                            titleText="Chicago, IL Rainfall (hours) in 2016", 
                                            yAxisText="Hours of Rain", 
                                            endExclude=c(),
                                            beginExclude=c(),
                                            endAdd=c("2016-07-13 1851"), 
                                            beginAdd=c("2016-03-24 2151", "2016-08-29 1951"),
                                            maxProb=1440, 
                                            sState=FALSE, 
                                            makePlots=TRUE
                                            )
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   14.00   33.50   70.51   83.50 1243.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A function can then be written to test the intervals for consistency:

intervalConsistency <- function(lst, 
                                pType, 
                                maxPrint=NULL
                                ){

    # FUNCTION ARGUMENTS
    # lst: list
    # pType: precipitation type
    # maxPrint: maximum number of mismatches to print (default is to print all)
    
    require(lubridate)
    
    # Extract the beginning and interval times
    begins <- lst[["testAllBegins"]]
    ends <- lst[["testAllEnds"]]
    durs <- lst[["testIntervals"]]

    # Create intervals from the raw list file
    precipInts <- lubridate::interval(begins, begins + durs - 1)

    
    # Extract the METAR and date-time information
    metar <- lst[["testFileProc"]][["origMETAR"]]
    dtime <- lst[["testFileProc"]][["dtime"]]
    
    # Take each METAR observation and check two factors
    # Is the precipitation type recorded in that METAR?
    # Does that METAR fall in any of the intervals?
    precipMETAR <- grepl(paste0("\\d{6}Z.*", pType, ".*RMK"), metar, perl=TRUE)
    intMETAR <- sapply(dtime, FUN=function(x) {x %within% precipInts %>% any()})

    # Check for the consistency of the observations and print the mismatches
    print(table(precipMETAR, intMETAR))

    mism <- which(precipMETAR != intMETAR)
    if (length(mism) == 0) {
        cat("\nFull matches between METAR observations and intervals\n")
    } else {
        if (is.null(maxPrint)) { maxPrint <- length(mism) }
        for (x in mism[1:min(maxPrint, length(mism))]) {
            cat("\nMismatch at time", strftime(dtime[x], format="%Y-%m-%d %H:%M", tz="UTC"), "UTC\n")
            print(metar[max(1, x-2):min(length(metar), x+2)])
        }
    }
    
    list(precipInts=precipInts, mismatches=mism, mismatchTimes=dtime[mism])
}

And, the function can then be run for the Chicago 2016 data:

tmp <- intervalConsistency(kordRain2016Test, pType="(?<!FZ)RA")
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals

So, the process works to extract precipitation data, and for the case of Chicago 2016, creates alignment using only the addition of 1 end time and 2 begin times.

An additional helper function is written to take the exceptions from suggestBeginEndTimes() and to create the vectors for endExclude(), beginExclude(), endAdd(), and beginAdd() automatically:

# Helper function to create times in the format preferred by the other functions
timeFormatter <- function(timeVec) {
    
    if (length(timeVec) == 0) {
        return(character(0))
    }
    
    paste0(as.Date(timeVec), 
           " ", 
           str_pad(lubridate::hour(timeVec), pad="0", side="left", width=2),
           str_pad(lubridate::minute(timeVec), pad="0", side="left", width=2)
           )
}

convertExceptionsToTimes <- function(excFile, maxCol=3) {
    
    # Basic beginAdd (just add the METAR time)
    beginAdd <- excFile %>%
        filter(needBegin) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endAdd (just add the METAR time)
    endAdd <- excFile %>%
        filter(needEnd) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endExclude (just delete the earliest end time - maybe add a guard check later)
    endExclude <- excFile %>%
        filter(overEnd) %>%
        select_at(vars(all_of(paste0("e", 1:maxCol)))) %>%
        pull(1)

    # Basic beginExclude (just delete the earliest begin time - maybe add a guard check later)
    beginExclude <- excFile %>%
        filter(overBegin) %>%
        select_at(vars(all_of(paste0("b", 1:maxCol)))) %>%
        pull(1)
    
    list(endAdd=endAdd, beginAdd=beginAdd, endExclude=endExclude, beginExclude=beginExclude)
    
}

And the process can be attempted on the 2016 Chicago snow data:

# Run for Chicago 2016, extracting snow that is not blowing snow
kord2016ExceptSN <- suggestBeginEndTimes(kord_2016, regMatch="(?<!BL)SN")
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-10 00:51:00 KORD 100051Z 31012KT 4SM -SN FEW024 OVC040 M09/M13 A2993 ~
## 2 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 3 2016-04-08 22:51:00 KORD 082251Z 31014G31KT 9SM -SN SCT045 BKN050 OVC080 02/M~
## 4 2016-12-24 01:51:00 KORD 240151Z 17011KT 4SM -SN BR SCT015 OVC039 01/M01 A299~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-09 04:51:00 2016-04-09 0446 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
kord2016ExceptSNList <- convertExceptionsToTimes(kord2016ExceptSN)

# Attempt to use on the kord2016 snow data
kordSnow2016Test <- runFullPrecipExtraction(kord_2016, 
                                            pType="(?<!BL)SN", 
                                            titleText="Chicago, IL Snowfall (hours) in 2016", 
                                            yAxisText="Hours of Snow", 
                                            endExclude=kord2016ExceptSNList$endExclude,
                                            beginExclude=kord2016ExceptSNList$beginExclude,
                                            endAdd=kord2016ExceptSNList$endAdd, 
                                            beginAdd=kord2016ExceptSNList$beginAdd,
                                            maxProb=1440, 
                                            sState=FALSE, 
                                            makePlots=TRUE
                                            )
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   20.25   47.50  134.12  140.75 1016.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

tmp <- intervalConsistency(kordSnow2016Test, pType="(?<!BL)SN")
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8532    0
##       TRUE      0  283
## 
## Full matches between METAR observations and intervals

While there is still some work to be done (the current function makes assumptions about the issues with the precipitation data and associated use of e1/b1, as well as not handling cases where there are, for example, 2 begins with no ends), the function shows promise in automating the process of handling precipitation data.

A few edits have been made to the underlying functions:

  • The search string is restricted to precipitation that follows ‘Z’ and preceds ‘RMK’ to avoid the issue of KMSN being trereated as snow
  • suggestBeginEndTimes is updated to allow for a user-defined maximum number of columns for precipitation begin and end times per hour (previously hard-coded to 3)
  • runFullPrecipExtraction is updated to allow for a user-defined maximum number of columns for precipitation begin and end times per hour (previously hard-coded to 3) and to create axis label and title if not passed

The functions can then be combined, allowing for everything to be run at once:

combinedPrecipitation <- function(df, 
                                  pType, 
                                  titleText=NULL,
                                  yAxisText=NULL,
                                  maxProb=1440,
                                  sState=FALSE,
                                  makePlots=TRUE, 
                                  axisMapper=precipMapper,
                                  titleMapper=cityNameMapper, 
                                  frameName=NULL, 
                                  maxConsistencyPrint=NULL
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble containing the METAR and dtime
    # pType: the regex code for extracting the relevant precipitation type
    # titleText: Desired plot title (will be created from name of df if NULL)
    # yAxisText: desired y-axis labels (will be created from regex if NULL)
    # maxProb: flag any precipitation events longer than this number of minutes
    # sState: boolean, is there precipitation in the first record?
    # makePlots: boolean, should plots be created?
    # axisMapper: mapping file for creating plot axis name from precipitation type
    # titleMapper: mapping file for creating plot title from data frame name
    # frameName: name of the data frame (kluge due to deparse(substitute(df)) returning "df" if this function is called by another function or in a for loop
    # maxConsistencyPrint: how many mismatches should be printed (default, NULL, prints all)
    
    # Create titleText and yAxisText if they are not already created
    if (is.null(yAxisText)) {
        yAxisText <- paste0("Hours of ", axisMapper[pType])
    }

    # Create the frame name if it has nott been passed
    if (is.null(frameName)) {
        frameName <- deparse(substitute(df))
    }
    
    # Create the title if it has not been provided
    if (is.null(titleText)) {
        titleText <- paste0(titleMapper[frameName], " - Hours of ", axisMapper[pType])
    }    
    
    # Get the suggested edits to the begin and end times, then convert to a usable list
    excSuggest <- suggestBeginEndTimes(df, regMatch=pType)    
    excList <- convertExceptionsToTimes(excSuggest)
    
    # Run the full precipitation extraction
    precipData <- runFullPrecipExtraction(df, 
                                          pType=pType, 
                                          titleText=titleText,
                                          yAxisText=yAxisText,
                                          endExclude=excList$endExclude,
                                          beginExclude=excList$beginExclude,
                                          endAdd=excList$endAdd, 
                                          beginAdd=excList$beginAdd,
                                          maxProb=maxProb, 
                                          sState=sState, 
                                          makePlots=makePlots,
                                          frameName=frameName
                                          )

    # Check for consistency    
    precipIntervals <- intervalConsistency(precipData, pType=pType, maxPrint=maxConsistencyPrint)

    # For the time being, return the most important data (the precipData list)
    precipData
    
}

The function can then be tested on a few files:

# Run for Chicago 2016 for snow, rain, and thunderstorm
kord2016SN <- combinedPrecipitation(kord_2016, pType="(?<!BL)SN")
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-10 00:51:00 KORD 100051Z 31012KT 4SM -SN FEW024 OVC040 M09/M13 A2993 ~
## 2 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 3 2016-04-08 22:51:00 KORD 082251Z 31014G31KT 9SM -SN SCT045 BKN050 OVC080 02/M~
## 4 2016-12-24 01:51:00 KORD 240151Z 17011KT 4SM -SN BR SCT015 OVC039 01/M01 A299~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-09 04:51:00 2016-04-09 0446 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   20.25   47.50  134.12  140.75 1016.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8532    0
##       TRUE      0  283
## 
## Full matches between METAR observations and intervals
kord2016RA <- combinedPrecipitation(kord_2016, pType="(?<!FZ)RA")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 2 2016-08-29 19:51:00 KORD 291951Z 25009KT 1SM R10L/1200VP6000FT +TSRA BKN041CB~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:51:00 KORD 131851Z 22009G18KT 10SM FEW027 SCT047 BKN250 28/22 A~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   14.00   33.50   70.51   83.50 1243.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals
kord2016TS <- combinedPrecipitation(kord_2016, pType="(?<!VC)TS")
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-10-12 17:51:00 KORD 121751Z 23008G19KT 3SM R10L/5000VP6000FT -TSRA BR FE~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 08:51:00 2016-08-28 0825 <NA>  NA   
## 2 2016-09-21 17:51:00 2016-09-21 1744 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 3 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 09:51:00 2016-08-28 0910 <NA>  NA   
## 2 2016-09-21 18:51:00 2016-09-21 1759 <NA>  NA   
## 3 2016-10-06 21:51:00 2016-10-06 2131 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.00   23.50   43.50   62.66   93.25  266.00 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
##  [1] "2016-09-28 23:19:00 UTC" "2016-10-01 16:02:00 UTC"
##  [3] "2016-10-06 10:29:00 UTC" "2016-10-06 23:56:00 UTC"
##  [5] "2016-10-12 12:16:00 UTC" "2016-10-12 17:36:00 UTC"
##  [7] "2016-10-12 17:51:00 UTC" "2016-10-16 09:57:00 UTC"
##  [9] "2016-10-26 14:48:00 UTC" "2016-10-30 07:56:00 UTC"
## [11] "2016-11-02 21:06:00 UTC"
## 
## Vector of Ends
##  [1] "2016-09-29 00:12:00 UTC" "2016-10-01 16:21:00 UTC"
##  [3] "2016-10-06 11:04:00 UTC" "2016-10-07 00:25:00 UTC"
##  [5] "2016-10-12 12:41:00 UTC" "2016-10-12 17:33:00 UTC"
##  [7] "2016-10-12 18:05:00 UTC" "2016-10-16 11:20:00 UTC"
##  [9] "2016-10-26 15:28:00 UTC" "2016-10-30 09:37:00 UTC"
## [11] "2016-11-02 21:21:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8749    0
##       TRUE      0   66
## 
## Full matches between METAR observations and intervals

The thunderstorm data contains an interesting consistency error:

kord_2016 %>% 
    filter(monthint==10, day==12, lubridate::hour(dtime) %in% c(14, 15, 16, 17, 18, 19)) %>%
    pull(origMETAR)
## [1] "KORD 121451Z 21013G21KT 10SM FEW050 SCT110 BKN140 OVC250 18/14 A2993 RMK AO2 SLP133 CB DSNT SW TCU DSNT S 60000 T01830144 56005"                                                                                     
## [2] "KORD 121551Z 23011G19KT 10SM FEW021 SCT050CB SCT120 BKN160 BKN250 20/15 A2992 RMK AO2 LTG DSNT SW SLP130 CB S MOV NE CB DSNT SW T02000150"                                                                           
## [3] "KORD 121651Z 24012G18KT 8SM SCT015 OVC020 18/16 A2995 RMK AO2 LTG DSNT SE RAB1554E33 SLP139 P0001 T01780156"                                                                                                         
## [4] "KORD 121751Z 23008G19KT 3SM R10L/5000VP6000FT -TSRA BR FEW014 BKN060CB OVC095 17/16 A2993 RMK AO2 LTG DSNT NE AND SE AND SW RAB1652 TSE33B36 SLP133 OCNL LTGCG E TS E MOV NE P0006 60007 T01720161 10206 20161 50003"
## [5] "KORD 121851Z 23008KT 6SM -RA BR BKN012 BKN095 OVC120 18/16 A2990 RMK AO2 TSE05 SLP123 P0002 T01780161"                                                                                                               
## [6] "KORD 121951Z 26013KT 5SM -RA BR BKN015 BKN019 OVC095 18/16 A2990 RMK AO2 RAE1857B20 SLP124 P0000 T01780156"
  • The 1651Z METAR has no thunderstorm information
  • The 1751Z METAR has thunderstorm information including -TSRA and TSE33B36
  • The 1851Z METAR has no active thunderstorm and TSE05

The algorithm as currently set up is creating a -3 minute interval (from 1736 to 1733) and then also creating a start time of 1751 to match the active -TSRA and following TSE05.

In practice, it is more likely that the TSE33 should be deleted, leaving a short interval of TSB36 (1736) through TSE05 (1805) that overlaps the -TSRA observation at 1751.

Next steps are to work through this further consistency check and test the functions on other locales.

An updated function suggestBeginEndTimes() is created to better manage the situation above. Broadly speaking, there are three types of issues with precipitation data (assuming focus on a single precipitation type, such as rain):

  • Current METAR precipitation state (yes/no) is the same as previous state, but there is a non-zero net of begin and end in the remarks
  • Current METAR has precipitation, previous METAR does not, and there is not a net +1 of begin/end
  • Current METAR has no precipitation, previous METAR does, and there is not a net -1 of begin/end

There are two ancillary precipitation issues that can be observed:

  • Net changes can be OK, but in a manner that implies negative or overlapping intervals - for example, -RA at 1651 and -RA at 1751 with remarks that show RAB18E21 (net change is zero which is OK, but rain already happening cannot begin and if the rain had actually ended at 21 then it would not be occurring at 51)
  • Net changes can be of magnitude greater than one - for example, no rain at 1151 and rain at 1251 and remarks that show RAB11E26B45B50 (net change in rain of +2 where state change makes clear it should be +1)

The primary issues will be addressed first in an updated function:

# Helper function to print out the potential issue
printBeginEndIssues <- function(df, nColMax=3) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble containing the issues
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # Variable names in the file are assumed to have a specific format as hard-coded in the function
    
    cat("\nNeed Begin time\n")
    df %>%
        filter(needBegin) %>%
        select(dtime, origMETAR) %>%
        print()
    
    cat("\nNeed End time\n")
    df %>%
        filter(needEnd) %>%
        select(dtime, origMETAR) %>%
        print()
    
    cat("\nExtraneous Begin time\n")
    df %>%
        filter(overBegin) %>%
        select(dtime, paste0("b", 1:nColMax)) %>%
        print()
    
    cat("\nExtraneous End time\n")
    df %>%
        filter(overEnd) %>%
        select(dtime, paste0("e", 1:nColMax)) %>%
        print()
    
    cat("\nWrong amount of begins or ends\n")
    df %>%
        filter(tooManyBE) %>%
        select(dtime, paste0("e", 1:nColMax), paste0("b", 1:nColMax)) %>%
        print()
    
}


# Update suggestBeginEndTimes
suggestBeginEndTimes <- function(df, 
                                 regMatch, 
                                 nColMax=3, 
                                 printNACols=FALSE, 
                                 printIssues=TRUE
                                 ) {
    
    # FUNCTION ARGUMENTS
    # df: the data frame to be examined
    # regMatch: the relevant regex extraction code
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # printNACols: boolean, whether to print the NA column sums
    # printIssues: boolean, whether to print out all of the issues
    
    # Pull the data and check for the specified precipitation pattern and lags
    # Needs to follow the \\d{6}Z that shows the time and precede the RMK that denotes the remarks
    sugStates <- df %>%
        select(dtime, origMETAR) %>%
        mutate(curPrecip=str_detect(origMETAR, paste0("\\d{6}Z.*", regMatch, ".*RMK")), 
               lagPrecip=lag(curPrecip, 1)
        )
    
    # Use the analysis data to look for begins and ends flagged in the remarks
    sugBE <- extractPrecipData(sugStates, pType=regMatch)
    
    # Inner join the data by dtime
    sugBEJoin <- sugBE %>% 
        select(dtime, precipData, chgPrecip=isPrecip, dateUTC, hourUTC)
    sugStates <- sugStates %>%
        inner_join(sugBEJoin, by="dtime")
    
    # Get the beginning and end times data for the desired precipitation type
    sugBegin <- getBeginEndTimeMatrix(sugStates, pState="B")
    sugEnd <- getBeginEndTimeMatrix(sugStates, pState="E")
    
    # Ensure that user-defined maximum number of columns is not exceeded
    if (ncol(sugBegin) > nColMax | ncol(sugEnd) > nColMax) { 
        stop("Function argument set to allow only for 0 - ", nColMax, "columns, investigate and fix")
    }
    
    # For each iteration, create testBTn and testETn (use NA if no column)
    for (ctr in 1:nColMax) {
        
        # Create NA as the baseline
        assign(paste0("testBT", ctr), NA)
        assign(paste0("testET", ctr), NA)
        
        # Assign begin if exists
        if (ncol(sugBegin) >= ctr) {
            bt <- getBeginEndTimeVector(sugBegin, sugStates, extractVar=paste0("V", ctr), extractSym="B")
            assign(paste0("testBT", ctr), bt)
        }
        
        # Assign end if exists
        if (ncol(sugEnd) >= ctr) {
            et <- getBeginEndTimeVector(sugEnd, sugStates, extractVar=paste0("V", ctr), extractSym="E")
            assign(paste0("testET", ctr), et)
        }
        
    }
    
    # Integrate to a single file, allowing for nMaxCol applications of variables
    # Create file sugExceptions as sugStates and initialize the columns for begins and ends
    sugExceptions <- sugStates %>%
        mutate(begins=0, ends=0)
    
    # Loop through up to nColMax, adding the appropriate bn=testBTn and en=testETn
    for (ctr in 1:nColMax) {
        sugExceptions <- sugExceptions %>%
            mutate(!!paste0("b", ctr) := get(paste0("testBT", ctr)), 
                   !!paste0("e", ctr) := get(paste0("testET", ctr)), 
                   begins=begins + ifelse(is.na(get(paste0("b", ctr))), 0, 1),
                   ends=ends + ifelse(is.na(get(paste0("e", ctr))), 0, 1)
            )
    }
    
    # Create the final version by checking key criteria
    # Add a criteria for b before e if previous; and e before b if not previous?
    # Potential issues include:
    # Current and lagged state are the same when begins != ends
    # Current is no, lagged is yes, ends <= begins
    # Current is yes, lagged is no, begins <= ends
    sugExceptions <- sugExceptions %>%
        mutate(issueA=(lagPrecip==curPrecip & begins!=ends), # no state change should have begins==ends
               issueB=(lagPrecip & !curPrecip & ends <= begins),  # change to 'no' should have ends>bsgins
               issueC=(curPrecip & !lagPrecip & begins <= ends),  # change to 'yse' should have begins>ends
               issueD=abs(begins-ends) > 1,  # state is a 1/0 so net change should be -1, 0, or 1
               needBegin=issueC & begins==0,
               needEnd=issueB & ends==0,
               overBegin=(issueA & (begins > ends)) | (issueB & (ends > 0)),
               overEnd=(issueA & (ends > begins)) | (issueC & (begins > 0)),
               tooManyBE=issueD
        )
    
    # Report on the NA status by column
    if (printNACols) {
        colSums(is.na(sugExceptions)) %>% 
            print()
    }

    # Print out the issue if requested
    if (printIssues) {
        printBeginEndIssues(sugExceptions, nColMax=nColMax)
    }
    
    sugExceptions %>%
        filter(issueA | issueB | issueC | issueD)
    
}

The function can then be run on the Chicago 2016 snow data and compared to previous:

# Run for the Chicago snow data
tmp <- suggestBeginEndTimes(kord_2016, regMatch="(?<!BL)SN")
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-10 00:51:00 KORD 100051Z 31012KT 4SM -SN FEW024 OVC040 M09/M13 A2993 ~
## 2 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 3 2016-04-08 22:51:00 KORD 082251Z 31014G31KT 9SM -SN SCT045 BKN050 OVC080 02/M~
## 4 2016-12-24 01:51:00 KORD 240151Z 17011KT 4SM -SN BR SCT015 OVC039 01/M01 A299~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-09 04:51:00 2016-04-09 0446 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
# Compare to old Chicago snow data
namesAll <- intersect(names(kord2016ExceptSN), names(tmp))
cat("\nVariables deleted:", setdiff(names(kord2016ExceptSN), namesAll), "\n")
## 
## Variables deleted: etob btoe
cat("\nVariables created:", setdiff(names(tmp), namesAll), "\n")
## 
## Variables created: issueA issueB issueC issueD tooManyBE
identical(kord2016ExceptSN[, namesAll], tmp[, namesAll])
## [1] TRUE

So, for the Chicago 2016 snow data, the new function is pulling the same information, just with somewhat different variables (keeping an issue log as variables may be helpful later).

The data are also re-run on Chicago thunderstorms to see if the negative interval is correctly pulled out:

# Run for the Chicago thunderstorm data
tmp <- suggestBeginEndTimes(kord_2016, regMatch="(?<!VC)TS") 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 08:51:00 2016-08-28 0825 <NA>  NA   
## 2 2016-09-21 17:51:00 2016-09-21 1744 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 4 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 09:51:00 2016-08-28 0910 <NA>  NA   
## 2 2016-09-21 18:51:00 2016-09-21 1759 <NA>  NA   
## 3 2016-10-06 21:51:00 2016-10-06 2131 <NA>  NA   
## 4 2016-10-12 17:51:00 2016-10-12 1733 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
# Show the data
tmp %>%
    select(dtime, curPrecip, lagPrecip, precipData, needBegin, needEnd, overBegin, overEnd, tooManyBE)
## # A tibble: 6 x 9
##   dtime               curPrecip lagPrecip precipData needBegin needEnd overBegin
##   <dttm>              <lgl>     <lgl>     <chr>      <lgl>     <lgl>   <lgl>    
## 1 2016-08-28 08:51:00 FALSE     FALSE     TSB25      FALSE     FALSE   TRUE     
## 2 2016-08-28 09:51:00 FALSE     FALSE     TSE10      FALSE     FALSE   FALSE    
## 3 2016-09-21 17:51:00 FALSE     FALSE     TSB44      FALSE     FALSE   TRUE     
## 4 2016-09-21 18:51:00 FALSE     FALSE     TSE1759    FALSE     FALSE   FALSE    
## 5 2016-10-06 21:51:00 FALSE     FALSE     TSE31      FALSE     FALSE   FALSE    
## 6 2016-10-12 17:51:00 TRUE      FALSE     TSE33B36   FALSE     FALSE   FALSE    
## # ... with 2 more variables: overEnd <lgl>, tooManyBE <lgl>
# Run through the conversion to begin and end times
tmp %>%
    convertExceptionsToTimes()
## $endAdd
## character(0)
## 
## $beginAdd
## character(0)
## 
## $endExclude
## [1] "2016-08-28 0910" "2016-09-21 1759" "2016-10-06 2131" "2016-10-12 1733"
## 
## $beginExclude
## [1] "2016-08-28 0825" "2016-09-21 1744"

It appears that this change will correct for the issue on 2016-10-12 with TSE33B36.

The process can then be run for the rain data in all of the files, with outputs saved to a single master list (initial exploration showed some issues with Minneapolis and New Orleans, so these are excluded to start):

# Create the list of files to run on - everything in 2016
fileNames <- grep(pattern="2016", names(cityNameMapper), value=TRUE)
fileNames <- fileNames[!(fileNames %in% c("kmsp_2016", "kmsy_2016"))]
print(fileNames)
##  [1] "kdtw_2016" "kewr_2016" "kgrb_2016" "kgrr_2016" "kiah_2016" "kind_2016"
##  [7] "klas_2016" "klnk_2016" "kmke_2016" "kmsn_2016" "kord_2016" "ksan_2016"
## [13] "ktvc_2016"
# Create a main storage list
listRA <- vector("list", length(fileNames))
names(listRA) <- fileNames

for (fName in fileNames) {

    listRA[[fName]] <- combinedPrecipitation(get(fName), pType="(?<!FZ)RA", frameName=fName)

}
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-24 11:53:00 KDTW 241153Z 04014G19KT 5SM -RAPL SCT008 OVC012 01/M02 A2~
## 2 2016-04-07 23:53:00 KDTW 072353Z 30007KT 7SM -SNRA BKN022 OVC041 01/M01 A2962~
## 
## Need End time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-07 22:53:00 KDTW 072253Z 30013G19KT 1 1/2SM R03R/6000VP6000FT -SN BKN~
## 2 2016-08-16 14:53:00 KDTW 161453Z 22019G22 10SM BKN014 OVC030 24/22 A2993 RMK ~
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-01 00:53:00 2016-04-01 0001 <NA>  NA   
## 2 2016-08-15 00:53:00 2016-08-14 2355 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-07-14 00:53:00 2016-07-14 0030 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   35.00   86.58   89.50  932.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8258    0
##       TRUE      0  560
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-02-24 19:51:00 2016-02-24 1858 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-11-20 08:51:00 2016-11-20 0758 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    17.0    41.0    97.3   104.0   858.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8195    0
##       TRUE      0  626
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-05-02 17:53:00 2016-05-02 1739 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   28.00   61.16   63.00  857.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8378    0
##       TRUE      0  425
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-06-26 14:53:00 KGRR 261453Z 25005KT 1 1/2SM RA BR SCT049 BKN090 OVC110 2~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-24 15:53:00 KGRR 241553Z 14004KT 10SM TS FEW048CB BKN080 OVC110 22/20~
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-06-26 13:53:00 2016-06-26 1346 <NA>  NA   
## 2 2016-08-12 09:53:00 2016-08-12 0854 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-07 12:53:00 2016-04-07 1215 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   40.50   94.02  111.75  901.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8208    0
##       TRUE      0  604
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-08-14 00:53:00 KIAH 140053Z 00000KT 4SM TSRA BR SCT007 SCT020 BKN045CB O~
## 2 2016-08-14 22:53:00 KIAH 142253Z 09008KT 1/2SM R27/1600V2400FT -TSRA FEW006 S~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                          
##   <dttm>              <chr>                                              
## 1 2016-12-04 10:53:00 KIAH 041053Z 32007KT 8SM OVC005 11/10 A2993 RMK AO2
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2              b3   
##   <dttm>              <chr>           <chr>           <lgl>
## 1 2016-08-21 19:53:00 2016-08-21 1857 2016-08-21 1945 NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-06-06 00:53:00 2016-06-06 0040 <NA>  NA   
## 2 2016-12-03 18:53:00 2016-12-03 1812 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -18.00   13.00   28.00   59.71   65.00  845.00 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
##  [1] "2016-08-19 18:36:00 UTC" "2016-08-20 18:04:00 UTC"
##  [3] "2016-08-20 19:15:00 UTC" "2016-08-21 07:52:00 UTC"
##  [5] "2016-08-21 14:55:00 UTC" "2016-08-21 19:45:00 UTC"
##  [7] "2016-08-22 16:26:00 UTC" "2016-08-23 20:57:00 UTC"
##  [9] "2016-08-24 20:02:00 UTC" "2016-08-25 18:12:00 UTC"
## [11] "2016-08-26 11:41:00 UTC"
## 
## Vector of Ends
##  [1] "2016-08-19 19:24:00 UTC" "2016-08-20 18:49:00 UTC"
##  [3] "2016-08-20 19:49:00 UTC" "2016-08-21 08:10:00 UTC"
##  [5] "2016-08-21 15:21:00 UTC" "2016-08-21 19:27:00 UTC"
##  [7] "2016-08-22 20:26:00 UTC" "2016-08-23 21:09:00 UTC"
##  [9] "2016-08-24 20:41:00 UTC" "2016-08-25 20:23:00 UTC"
## [11] "2016-08-26 12:26:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8367    1
##       TRUE      0  448
## 
## Mismatch at time 2016-03-23 21:53 UTC
## [1] "KIAH 231953Z 17012G18KT 10SM BKN032 OVC050 23/17 A2986 RMK AO2 SLP110 T02280167"                                     
## [2] "KIAH 232053Z 16017G24KT 10SM SCT030 OVC048 22/18 A2984 RMK AO2 RAB02E12 SLP104 SCT V BKN P0000 60000 T02220178 56021"
## [3] "KIAH 232153Z 17011KT 10SM SCT029 OVC050 22/18 A2983 RMK AO2 RAB2058E13 SLP102 P0000 T02170178"                       
## [4] "KIAH 232253Z 16010KT 10SM SCT030 OVC055 21/18 A2982 RMK AO2 SLP098 RAB2145E16 P0000 T02060183"                       
## [5] "KIAH 232353Z 16014G19KT 10SM BKN018 OVC050 20/18 A2982 RMK AO2 SLP096 60000 T02000183 10244 20200 56008"             
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-18 17:54:00 KIND 181754Z 24013KT 10SM -TSRA BKN030CB OVC040 24/21 A30~
## 
## Need End time
## # A tibble: 3 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-07 23:54:00 KIND 072354Z 30011KT 10SM BKN037 BKN060 04/M01 A2980 RMK ~
## 2 2016-05-08 19:54:00 KIND 081954Z 19007KT 10SM FEW010 OVC080 14/09 A2995 RMK A~
## 3 2016-08-07 19:54:00 KIND 071954Z 29007KT 10SM SCT060 SCT250 31/16 A2999 RMK A~
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-09-05 21:54:00 2016-09-05 2101 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   16.00   39.50   85.24   97.75  886.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8144    0
##       TRUE      0  623
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-30 23:56:00 KLAS 302356Z 04026G31KT 6SM TSRA SCT035 BKN080CB OVC100 1~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    16.0    29.0    52.1    57.5   502.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8710    0
##       TRUE      0  108
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-08-29 22:54:00 KLNK 292254Z 35010KT 8SM -TSRA FEW070 BKN100 OVC120 23/20~
## 2 2016-09-16 06:54:00 KLNK 160654Z 36015G25KT 6SM RA BR FEW070 OVC110 16/15 A30~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:54:00 KLNK 131854Z 26005KT 10SM CLR 32/16 A2988 RMK AO2 SLP106 ~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   34.00   67.63   82.00  522.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8500    0
##       TRUE      0  313
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-10 10:52:00 KMKE 101052Z 16017G26KT 8SM -PLRASN FEW046 BKN065 OVC075 ~
## 2 2016-06-10 03:52:00 KMKE 100352Z 20005KT 10SM -TSRA SCT070CB BKN090 18/16 A29~
## 3 2016-08-20 13:52:00 KMKE 201352Z 17010KT 10SM -RA FEW016 FEW035 BKN090 BKN150~
## 4 2016-10-07 02:52:00 KMKE 070252Z 18007KT 3SM VCTS -RA BR FEW030 BKN070 BKN095~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    14.0    32.0    71.8    73.0  1335.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8346    0
##       TRUE      0  462
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-16 11:53:00 KMSN 161153Z 16011KT 10SM FEW090 SCT110 OVC150 18/16 A299~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-09-07 12:53:00 2016-09-07 1252 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   29.00   70.11   74.75  990.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8347    0
##       TRUE      0  451
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 2 2016-08-29 19:51:00 KORD 291951Z 25009KT 1SM R10L/1200VP6000FT +TSRA BKN041CB~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:51:00 KORD 131851Z 22009G18KT 10SM FEW027 SCT047 BKN250 28/22 A~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   14.00   33.50   70.51   83.50 1243.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   12.00   21.00   55.53   51.75  638.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8612    0
##       TRUE      0  198
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-28 19:53:00 KTVC 281953Z 36010KT 10SM FEW027 FEW100 SCT250 19/11 A300~
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-06-06 21:53:00 2016-06-06 2117 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 3 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-03-28 10:53:00 2016-03-28 0957 <NA>  NA   
## 2 2016-06-06 20:53:00 2016-06-06 2049 <NA>  NA   
## 3 2016-12-02 16:53:00 2016-12-02 1629 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.50   30.00   66.53   68.00  642.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8345    0
##       TRUE      0  469
## 
## Full matches between METAR observations and intervals

Initial outputs are encouraging:

  • In all locales, the rain intervals created perfectly align with the hourly METAR rain data (with the exception of Houston, TX on 2016-03-23 21:53)
  • In all locales, the minimum rain interval length is positive (with the exception of Houston, TX 2016-08-21 1945/1927)
  • In all locales, the maximum rain interval length is under 24 hours

Further diagnostics of the Houston situation include:

# August 21, 2016 issue
kiah_2016 %>% 
    filter(lubridate::date(dtime)=="2016-08-21", lubridate::hour(dtime) %in% c(18, 19, 20, 21)) %>%
    pull(origMETAR)
## [1] "KIAH 211853Z 08005KT 10SM FEW018 FEW033 SCT090 BKN150 BKN180 BKN250 29/22 A3001 RMK AO2 SLP160 CB DSNT SW T02890222"         
## [2] "KIAH 211953Z 13009KT 10SM SCT029 BKN049 BKN100 BKN150 BKN250 27/24 A2999 RMK AO2 RAB1857E27B45RAEMM SLP152 P0000 T02720244 $"
## [3] "KIAH 212053Z VRB05KT 10SM FEW017 SCT029 SCT050 BKN160 BKN250 30/24 A2998 RMK AO2 SLP150 60000 T03000244 58012"               
## [4] "KIAH 212153Z 12009KT 10SM BKN023 BKN160 BKN250 29/24 A2997 RMK AO2 SLP148 BKN023 V SCT T02940239"
# March 23, 2016 issue
kiah_2016 %>% 
    filter(lubridate::date(dtime)=="2016-03-23", lubridate::hour(dtime) %in% c(20, 21, 22, 23)) %>%
    pull(origMETAR)
## [1] "KIAH 232053Z 16017G24KT 10SM SCT030 OVC048 22/18 A2984 RMK AO2 RAB02E12 SLP104 SCT V BKN P0000 60000 T02220178 56021"
## [2] "KIAH 232153Z 17011KT 10SM SCT029 OVC050 22/18 A2983 RMK AO2 RAB2058E13 SLP102 P0000 T02170178"                       
## [3] "KIAH 232253Z 16010KT 10SM SCT030 OVC055 21/18 A2982 RMK AO2 SLP098 RAB2145E16 P0000 T02060183"                       
## [4] "KIAH 232353Z 16014G19KT 10SM BKN018 OVC050 20/18 A2982 RMK AO2 SLP096 60000 T02000183 10244 20200 56008"

The August 21 situation highlights an area where the guard logic needs improvement:

  • 1953Z … RAB1857E27B45EMM requires either 1) replace RAEMM with RAE53 (treat as needEnd), or 2) remove the B45 that is paired to the EMM (treat as overBegin)
  • The current guard logic will see this as 2 begins and 1 end and treat as an overBegin (OK so far); since there is an end time, the guard logic just delete the first begin time to make them equal
  • This is a guard logic error of potentially very minor consequence in that it produces under 60 minutes of problem across 15 cities over the course of a year

The March 23 situation highlights an error in the METAR:

  • 2253Z … RAB2145E16 creates rain at a time period that belongs to the 2153Z METAR (likely a transpose typo since RAB2154E16 would be appropriate in the 2253Z METAR)
  • The guard logic should be updated to only allow four-digit times that ‘belong’ to the current METAR, flagging a warning and converting them to be in range automatically

The New Orleans data are run stand-alone since there were a few more issues:

tmpMSY <- combinedPrecipitation(kmsy_2016, pType="(?<!FZ)RA", frameName="kmsy_2016")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-27 16:53:00 KMSY 271653Z 12010KT 10SM BKN008 BKN030CB BKN060 OVC130 2~
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-03-11 20:53:00 2016-03-11 2007 <NA>  NA   
## 2 2016-08-14 23:53:00 2016-08-14 2339 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-12-13 23:53:00 2016-12-13 2258 <NA>  NA   
## 2 2017-01-01 09:53:00 2017-01-01 0852 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
## Warning in unclass(time1) - unclass(time2): longer object length is not a
## multiple of shorter object length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -528234      15      30   -1142      66     738 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
## [1] "2017-01-01 13:06:00 UTC" "2017-01-01 17:04:00 UTC"
## [3] "2017-01-01 18:41:00 UTC" "2017-01-01 19:51:00 UTC"
## [5] "2017-01-01 20:15:00 UTC"
## 
## Vector of Ends
## [1] "2017-01-01 13:48:00 UTC" "2017-01-01 17:23:00 UTC"
## [3] "2017-01-01 18:54:00 UTC" "2017-01-01 19:55:00 UTC"
## [5] "2017-01-01 20:49:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8375    0
##       TRUE      3  435
## 
## Mismatch at time 2017-01-01 21:53 UTC
## [1] "KMSY 011953Z 20015G21KT 10SM +RA FEW022 SCT046 BKN065 OVC130 24/22 A2991 RMK AO2 LTG DSNT N RAE1854B51 SLP132 P0000 T02440222 $"                       
## [2] "KMSY 012053Z 28007KT 9SM OVC024 19/19 A2993 RMK AO2 WSHFT 1957 LTG DSNT NE AND W RAE1955B15E49 SLP140 CB DSNT N P0007 60013 T01940189 55004 $"         
## [3] "KMSY 012153Z 29007KT 5SM -TSRA BR BKN028CB OVC035 19/19 A2995 RMK AO2 LTG DSNT W RAB2055 TSB31 SLP145 OCNL LTGICCG W-N TS W-N MOV NE P0004 T01890189 $"
## [4] "KMSY 012253Z 29007KT 5SM RA BR BKN032CB BKN041 OVC050 18/18 A2995 RMK AO2 LTG DSNT SE-W TSE06 SLP146 CB ALQDS MOV NE P0019 T01830183 $"                
## [5] "KMSY 012353Z 22003KT 3SM RA BR FEW034 SCT041 OVC055 18/18 A2998 RMK AO2 LTG DSNT E SLP155 CB ALQDS MOV NE P0028 60064 T01780178 10244 20178 53016 $"   
## 
## Mismatch at time 2017-01-01 22:53 UTC
## [1] "KMSY 012053Z 28007KT 9SM OVC024 19/19 A2993 RMK AO2 WSHFT 1957 LTG DSNT NE AND W RAE1955B15E49 SLP140 CB DSNT N P0007 60013 T01940189 55004 $"         
## [2] "KMSY 012153Z 29007KT 5SM -TSRA BR BKN028CB OVC035 19/19 A2995 RMK AO2 LTG DSNT W RAB2055 TSB31 SLP145 OCNL LTGICCG W-N TS W-N MOV NE P0004 T01890189 $"
## [3] "KMSY 012253Z 29007KT 5SM RA BR BKN032CB BKN041 OVC050 18/18 A2995 RMK AO2 LTG DSNT SE-W TSE06 SLP146 CB ALQDS MOV NE P0019 T01830183 $"                
## [4] "KMSY 012353Z 22003KT 3SM RA BR FEW034 SCT041 OVC055 18/18 A2998 RMK AO2 LTG DSNT E SLP155 CB ALQDS MOV NE P0028 60064 T01780178 10244 20178 53016 $"   
## 
## Mismatch at time 2017-01-01 23:53 UTC
## [1] "KMSY 012153Z 29007KT 5SM -TSRA BR BKN028CB OVC035 19/19 A2995 RMK AO2 LTG DSNT W RAB2055 TSB31 SLP145 OCNL LTGICCG W-N TS W-N MOV NE P0004 T01890189 $"
## [2] "KMSY 012253Z 29007KT 5SM RA BR BKN032CB BKN041 OVC050 18/18 A2995 RMK AO2 LTG DSNT SE-W TSE06 SLP146 CB ALQDS MOV NE P0019 T01830183 $"                
## [3] "KMSY 012353Z 22003KT 3SM RA BR FEW034 SCT041 OVC055 18/18 A2998 RMK AO2 LTG DSNT E SLP155 CB ALQDS MOV NE P0028 60064 T01780178 10244 20178 53016 $"

New Orleans is interesting in that the final hours of the final day of the pulled METAR are all raining. This leads (correctly) to no final RAE in the METAR, and the logic should be updated to place an artificial RAE in the last minute of the file. Without that, there is recycling, and the interval becomes -365 days. There is a similar need to add logic in case the beginning state is RAB.

The Minneapolis data are run stand-alone since there are many issues:

tmpMSP <- combinedPrecipitation(kmsp_2016, pType="(?<!FZ)RA", frameName="kmsp_2016", maxConsistencyPrint = 5)
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-06 07:53:00 KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 ~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-11-22 13:53:00 KMSP 221353Z 13018KT 10SM -PL OVC055 01/M06 A3015 RMK AO2~
## 
## Extraneous Begin time
## # A tibble: 4 x 4
##   dtime               b1              b2              b3   
##   <dttm>              <chr>           <chr>           <lgl>
## 1 2016-05-29 19:53:00 2016-05-29 1931 <NA>            NA   
## 2 2016-06-20 04:53:00 2016-06-20 0411 <NA>            NA   
## 3 2016-09-06 08:53:00 2016-09-06 0744 2016-09-06 0817 NA   
## 4 2016-11-29 07:53:00 2016-11-29 0718 <NA>            NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-05 15:53:00 2016-04-05 1522 <NA>  NA   
## 2 2016-10-18 04:53:00 2016-10-18 0435 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
## Warning in unclass(time1) - unclass(time2): longer object length is not a
## multiple of shorter object length
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -476532.0      18.0      67.5    -818.0     188.0   28458.0 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
## [1] "2016-12-05 18:06:00 UTC" "2016-12-05 18:50:00 UTC"
## [3] "2016-12-06 00:35:00 UTC" "2016-12-23 17:02:00 UTC"
## [5] "2016-12-25 17:59:00 UTC"
## 
## Vector of Ends
## [1] "2016-12-05 18:59:00 UTC" "2016-12-06 00:50:00 UTC"
## [3] "2016-12-23 17:45:00 UTC" "2016-12-25 22:35:00 UTC"
## [5] "2016-12-26 05:07:00 UTC"
## 
## 
## Potential problem Detected - very long.  Data to help investigate
## 
## Positions with problems are: 260 266 268 269 270 271 278 280 288 289 294 298 303 304 312 314 319 321 327 328 350 353 354
## Vector of Begins
##  [1] "2016-09-07 01:55:00 UTC" "2016-09-07 02:22:00 UTC"
##  [3] "2016-09-07 12:44:00 UTC" "2016-09-07 14:12:00 UTC"
##  [5] "2016-09-07 15:08:00 UTC" "2016-09-07 17:02:00 UTC"
##  [7] "2016-09-09 00:52:00 UTC" "2016-09-09 05:36:00 UTC"
##  [9] "2016-09-09 07:51:00 UTC" "2016-09-09 11:49:00 UTC"
## [11] "2016-09-09 15:33:00 UTC"
## 
## Vector of Ends
##  [1] "2016-09-07 02:57:00 UTC" "2016-09-07 13:19:00 UTC"
##  [3] "2016-09-07 14:57:00 UTC" "2016-09-07 15:44:00 UTC"
##  [5] "2016-09-07 17:11:00 UTC" "2016-09-09 01:01:00 UTC"
##  [7] "2016-09-09 05:53:00 UTC" "2016-09-09 09:31:00 UTC"
##  [9] "2016-09-09 12:39:00 UTC" "2016-09-09 15:57:00 UTC"
## [11] "2016-09-10 06:10:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  5875 2515
##       TRUE      0  427
## 
## Mismatch at time 2016-09-06 08:53 UTC
## [1] "KMSP 060653Z 21010G17KT 10SM BKN050 BKN060 OVC090 27/22 A2983 RMK AO2 SLP097 T02720217"                                                                                             
## [2] "KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 26/22 A2982 RMK AO2 PK WND 22027/0742 LTG DSNT NE AND S SLP092 TSB35 OCNL LTGICCG OHD TS OHD MOV NE P0015 T02560222"       
## [3] "KMSP 060853Z 24003KT 8SM TS BKN049CB BKN060 OVC080 25/22 A2981 RMK AO2 LTG DSNT ALQDS SLP088 RAB0744E01B17 TSB0735E12B22 OCNL LTGICCG OHD TS OHD MOV NE P0001 60016 T02500222 58008"
## [4] "KMSP 060953Z 25007KT 4SM -TSRA BR SCT027CB BKN041 OVC080 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAB34 TSE01B13E28B38 SLP089 FRQ LTGICCG OHD TS OHD MOV NE P0020 T02390222"              
## [5] "KMSP 061053Z 18003KT 5SM -TSRA BR FEW029 BKN045CB OVC065 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAE01B14E28B46 TSE02B12E36B41 SLP088 OCNL LTGICCG N-E VCTS N-E MOV NE P0005 T02440222"  
## 
## Mismatch at time 2016-09-06 14:53 UTC
## [1] "KMSP 061253Z 00000KT 9SM -RA FEW007 SCT015 BKN100 OVC150 22/21 A2989 RMK AO2 LTG DSNT SE RAE12B37 SLP116 P0004 T02220206"     
## [2] "KMSP 061353Z 21007KT 8SM -RA FEW010 OVC090 22/21 A2996 RMK AO2 RAE00B21 SLP140 P0007 T02220206"                               
## [3] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"
## [4] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"       
## [5] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                   
## 
## Mismatch at time 2016-09-06 16:53 UTC
## [1] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"       
## [2] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [3] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [4] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [5] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## 
## Mismatch at time 2016-09-06 17:53 UTC
## [1] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [2] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [3] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [4] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [5] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## 
## Mismatch at time 2016-09-06 18:53 UTC
## [1] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [2] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [3] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [4] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## [5] "KMSP 062053Z 19003KT 10SM FEW018 BKN041 BKN130 BKN250 25/19 A2994 RMK AO2 LTG DSNT SE AND S SLP134 60000 T02500189 50002"
testMSP <- tibble::tibble(beginTime=tmpMSP$testAllBegins, 
                          endTime=c(tmpMSP$testAllEnds, NA),  # make the same length as begins
                          ints=tmpMSP$testIntervals
                          )

testMSP %>%
    filter(ints >= 2000 | lubridate::date(beginTime) == "2016-09-06") %>%
    as.data.frame()
##              beginTime             endTime       ints
## 1  2016-09-06 07:53:00 2016-09-06 08:01:00     8 mins
## 2  2016-09-06 08:17:00 2016-09-06 10:01:00   104 mins
## 3  2016-09-06 09:34:00 2016-09-06 10:28:00    54 mins
## 4  2016-09-06 10:14:00 2016-09-06 11:01:00    47 mins
## 5  2016-09-06 10:46:00 2016-09-06 12:12:00    86 mins
## 6  2016-09-06 11:16:00 2016-09-06 13:00:00   104 mins
## 7  2016-09-06 12:37:00 2016-09-06 14:47:00   130 mins
## 8  2016-09-06 13:21:00 2016-09-06 16:40:00   199 mins
## 9  2016-09-06 15:13:00 2016-09-06 19:49:00   276 mins
## 10 2016-09-06 19:20:00 2016-09-07 01:25:00   365 mins
## 11 2016-09-10 05:46:00 2016-09-13 23:51:00  5405 mins
## 12 2016-09-14 01:09:00 2016-09-16 02:19:00  2950 mins
## 13 2016-09-17 01:07:00 2016-09-19 16:05:00  3778 mins
## 14 2016-09-19 14:58:00 2016-09-21 05:04:00  2286 mins
## 15 2016-09-22 11:01:00 2016-09-23 23:46:00  2205 mins
## 16 2016-09-25 22:14:00 2016-09-28 00:06:00  2992 mins
## 17 2016-09-27 21:33:00 2016-10-04 15:07:00  9694 mins
## 18 2016-10-05 01:09:00 2016-10-06 22:52:00  2743 mins
## 19 2016-10-07 06:46:00 2016-10-12 04:37:00  7071 mins
## 20 2016-10-12 10:59:00 2016-10-15 22:57:00  5038 mins
## 21 2016-10-18 05:48:00 2016-10-26 13:17:00 11969 mins
## 22 2016-10-26 16:18:00 2016-10-29 17:11:00  4373 mins
## 23 2016-10-29 22:48:00 2016-11-18 17:06:00 28458 mins
## 24 2016-11-18 18:13:00 2016-11-22 13:53:00  5500 mins
## 25 2016-11-22 22:53:00 2016-11-25 00:21:00  2968 mins
## 26 2016-11-24 22:49:00 2016-11-28 05:16:00  4707 mins
## 27 2016-12-01 07:20:00 2016-12-05 18:28:00  6428 mins
## 28 2016-12-06 00:35:00 2016-12-23 17:45:00 25510 mins
## 29 2016-12-23 17:02:00 2016-12-25 22:35:00  3213 mins

The Minneapolis data appear to first become misaligned during a September 6 rain event.

  • 2016-09-06 0653 no rain
  • 2016-09-06 0753 -TSRA TSB35 (algorithm correctly flags “needs begin” for 0753)
  • 2016-09-06 0853 TS RAB0744E01B17 (algorithm correctly flags as “over-begin” but guard logic needs to be updated to delete both of the beginning times since net-end should be 1 for -TSRA becoming TS)
  • Every following record will be erroneous, including overlapping intervals, since the extra begin record is kept
  • Additionally, the RAB0744 should be in the 0753 observation (same issue as with Houston)

Next steps are to investigate and fix the following:

  • Remove any RAB/RAE observations that do not “belong” to the hour that falls under the current METAR
  • Check whether overBegin or overEnd have 2+ causes, and upgrade the logic to delete all if needed
  • Select the correct RAB to discard when RABnnEnnBnn is observed where there should be no net change (same guard issue for RAE)
  • Upgrade the logic to manage a start state of precipitation (create RAB at the beginning so the interval overlaps) or an end state of precipitation (create an RAE at the end to avoid recycling vectors)

The functions that need updating are suggestBeginEndTimes and convertExceptionsToTimes and getBeginEndTimeMatrix:

# Updated function to account for:
# 1. Remove any RAB/RAE observations that have dddd that 'belongs' to a different hourly METAR
# 2. Check whether overBegin or overEnd have 2+ causes, and delete all that are necessary
# 3. Upgrade logic to manage a starting state of precipitation, or an ending state of precipitation
# 4. Select the correct RAB/RAE to discard in cases like RABnnEnnBnn where net change should be 0


# Helper function to extract the beginning and ending times using str_extract_all
# Updated to check that four-digit times are within the proper hour
getBeginEndTimeMatrix <- function(file, 
                                  pullVar="precipData", 
                                  pState="B", 
                                  timeVar=NULL
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # file: file containing data
    # pullVar: the variable of interest in file
    # pState: the precipitation state of interest - B (begin) or E (end)
    # timeVar: the datetime variable in file which can be used to check consistency (NULL means no checks)
    
    # The basic extraction process
    mtxBET <- file %>%
        pull(pullVar) %>%
        str_extract_all(paste0(pState, "\\d+"), simplify=TRUE) %>%
        as.data.frame(stringsAsFactors=FALSE) %>%
        tibble::as_tibble()
    
    # If time variable has been passed, check for consistency of extracted times
    if (!is.null(timeVar)) {
        chk <- mtxBET %>%
            mutate(time=pull(file, timeVar)) %>%
            pivot_longer(-time, names_to="column", values_to="value") %>%
            mutate(convVal=str_replace(value, pattern="B", replacement=""), 
                   keyMin=lubridate::minute(time), 
                   delta=ifelse(str_length(convVal)==4, as.integer(str_sub(convVal, 3, -1))-keyMin, NA)
                   )
        
        issues <- chk %>%
            filter(delta <= 0)
        
        if (nrow(issues) > 0) {
            
            # Flag the issue
            cat("\nMETAR has begin or end time belonging to a different record - will be deleted\n")
            print(issues)
            
            # Delete the issue
            chk <- chk %>%
                filter(is.na(delta) | delta > 0)
            
            # Fix the V1-Vn due to deletions
            chk <- chk %>%
                group_by(time) %>%
                mutate(column=paste0("V", row_number())) %>%
                ungroup()
            
            # Update mtxBET
            mtxBET <- chk %>%
                select(time, column, value) %>%
                pivot_wider(names_from=column, values_from=value) %>%
                select(-time)
        }
    }
    
    # Return the matrix (converted to data frame)
    mtxBET
    
}


# Update to pull any cases of 2+ overBegin or 2+ overEnd
convertExceptionsToTimes <- function(excFile, maxCol=3) {
    
    # Basic beginAdd (just add the METAR time)
    beginAdd <- excFile %>%
        filter(needBegin) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endAdd (just add the METAR time)
    endAdd <- excFile %>%
        filter(needEnd) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endExclude (just delete the earliest end time - maybe add a guard check later)
    endExclude <- excFile %>%
        filter(overEnd) %>%
        select_at(vars(all_of(c("dtime", "nOverEnd", paste0("e", 1:maxCol))))) %>%
        pivot_longer(-c(dtime, nOverEnd), names_to="column", values_to="time") %>%
        filter(as.integer(str_replace(column, pattern="e", replacement="")) <= nOverEnd) %>%
        pull(time)

    # Basic beginExclude (just delete the earliest begin time - maybe add a guard check later)
    beginExclude <- excFile %>%
        filter(overBegin) %>%
        select_at(vars(all_of(c("dtime", "nOverBegin", paste0("b", 1:maxCol))))) %>%
        pivot_longer(-c(dtime, nOverBegin), names_to="column", values_to="time") %>%
        filter(as.integer(str_replace(column, pattern="b", replacement="")) <= nOverBegin) %>%
        pull(time)
    
    list(endAdd=endAdd, beginAdd=beginAdd, endExclude=endExclude, beginExclude=beginExclude)
    
}



# Updated to pull out the number of overBegin and overEnd
suggestBeginEndTimes <- function(df, 
                                 regMatch, 
                                 nColMax=3, 
                                 printNACols=FALSE, 
                                 printIssues=TRUE
                                 ) {
    
    # FUNCTION ARGUMENTS
    # df: the data frame to be examined
    # regMatch: the relevant regex extraction code
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # printNACols: boolean, whether to print the NA column sums
    # printIssues: boolean, whether to print out all of the issues
    
    # Pull the data and check for the specified precipitation pattern and lags
    # Needs to follow the \\d{6}Z that shows the time and precede the RMK that denotes the remarks
    sugStates <- df %>%
        select(dtime, origMETAR) %>%
        mutate(curPrecip=str_detect(origMETAR, paste0("\\d{6}Z.*", regMatch, ".*RMK")), 
               lagPrecip=lag(curPrecip, 1)
        )
    
    # Use the analysis data to look for begins and ends flagged in the remarks
    sugBE <- extractPrecipData(sugStates, pType=regMatch)
    
    # Inner join the data by dtime
    sugBEJoin <- sugBE %>% 
        select(dtime, precipData, chgPrecip=isPrecip, dateUTC, hourUTC)
    sugStates <- sugStates %>%
        inner_join(sugBEJoin, by="dtime")
    
    # Get the beginning and end times data for the desired precipitation type
    sugBegin <- getBeginEndTimeMatrix(sugStates, pState="B", timeVar="dtime")
    sugEnd <- getBeginEndTimeMatrix(sugStates, pState="E", timeVar="dtime")
    
    # Ensure that user-defined maximum number of columns is not exceeded
    if (ncol(sugBegin) > nColMax | ncol(sugEnd) > nColMax) { 
        stop("Function argument set to allow only for 0 - ", nColMax, "columns, investigate and fix")
    }
    
    # For each iteration, create testBTn and testETn (use NA if no column)
    for (ctr in 1:nColMax) {
        
        # Create NA as the baseline
        assign(paste0("testBT", ctr), NA)
        assign(paste0("testET", ctr), NA)
        
        # Assign begin if exists
        if (ncol(sugBegin) >= ctr) {
            bt <- getBeginEndTimeVector(sugBegin, sugStates, extractVar=paste0("V", ctr), extractSym="B")
            assign(paste0("testBT", ctr), bt)
        }
        
        # Assign end if exists
        if (ncol(sugEnd) >= ctr) {
            et <- getBeginEndTimeVector(sugEnd, sugStates, extractVar=paste0("V", ctr), extractSym="E")
            assign(paste0("testET", ctr), et)
        }
        
    }
    
    # Integrate to a single file, allowing for nMaxCol applications of variables
    # Create file sugExceptions as sugStates and initialize the columns for begins and ends
    sugExceptions <- sugStates %>%
        mutate(begins=0, ends=0)
    
    # Loop through up to nColMax, adding the appropriate bn=testBTn and en=testETn
    for (ctr in 1:nColMax) {
        sugExceptions <- sugExceptions %>%
            mutate(!!paste0("b", ctr) := get(paste0("testBT", ctr)), 
                   !!paste0("e", ctr) := get(paste0("testET", ctr)), 
                   begins=begins + ifelse(is.na(get(paste0("b", ctr))), 0, 1),
                   ends=ends + ifelse(is.na(get(paste0("e", ctr))), 0, 1)
            )
    }
    
    # Create the final version by checking key criteria
    # Add a criteria for b before e if previous; and e before b if not previous?
    # Potential issues include:
    # Current and lagged state are the same when begins != ends
    # Current is no, lagged is yes, ends <= begins
    # Current is yes, lagged is no, begins <= ends
    sugExceptions <- sugExceptions %>%
        mutate(issueA=(lagPrecip==curPrecip & begins!=ends), # no state change should have begins==ends
               issueB=(lagPrecip & !curPrecip & ends <= begins),  # change to 'no' should have ends>bsgins
               issueC=(curPrecip & !lagPrecip & begins <= ends),  # change to 'yse' should have begins>ends
               issueD=abs(begins-ends) > 1,  # state is a 1/0 so net change should be -1, 0, or 1
               netBeginEndShouldBe=(curPrecip - lagPrecip),
               netBeginEndIs=(begins - ends),
               needBegin=issueC & begins==0,
               needEnd=issueB & ends==0,
               overBegin=(issueA & (begins > ends)) | (issueB & (ends > 0)),
               overEnd=(issueA & (ends > begins)) | (issueC & (begins > 0)),
               nOverBegin=pmax(0, netBeginEndIs-netBeginEndShouldBe-needEnd),
               nOverEnd=pmax(0, netBeginEndShouldBe-netBeginEndIs-needBegin),
               tooManyBE=issueD
        )
    
    # Report on the NA status by column
    if (printNACols) {
        colSums(is.na(sugExceptions)) %>% 
            print()
    }
    
    # Print out the issue if requested
    if (printIssues) {
        printBeginEndIssues(sugExceptions, nColMax=nColMax)
    }
    
    sugExceptions %>%
        filter(issueA | issueB | issueC | issueD)
    
}

With the logic updated to correct for times in the wrong METAR as well as 2+ occurrences of overBegin or overEnd, the process is re-run for Minneapolis:

tmpMSP2 <- combinedPrecipitation(kmsp_2016, 
                                 pType="(?<!FZ)RA",
                                 frameName="kmsp_2016", 
                                 maxConsistencyPrint = 5
                                 )
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## METAR has begin or end time belonging to a different record - will be deleted
## # A tibble: 1 x 6
##   time                column value convVal keyMin delta
##   <dttm>              <chr>  <chr> <chr>    <int> <int>
## 1 2016-09-06 08:53:00 V1     B0744 0744        53    -9
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-06 07:53:00 KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 ~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-11-22 13:53:00 KMSP 221353Z 13018KT 10SM -PL OVC055 01/M06 A3015 RMK AO2~
## 
## Extraneous Begin time
## # A tibble: 4 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-05-29 19:53:00 2016-05-29 1931 <NA>  NA   
## 2 2016-06-20 04:53:00 2016-06-20 0411 <NA>  NA   
## 3 2016-09-06 08:53:00 2016-09-06 0817 <NA>  NA   
## 4 2016-11-29 07:53:00 2016-11-29 0718 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-05 15:53:00 2016-04-05 1522 <NA>  NA   
## 2 2016-10-18 04:53:00 2016-10-18 0435 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
## Warning in unclass(time1) - unclass(time2): longer object length is not a
## multiple of shorter object length
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -476532.0      18.0      67.5    -817.9     188.0   28458.0 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
## [1] "2016-12-05 18:06:00 UTC" "2016-12-05 18:50:00 UTC"
## [3] "2016-12-06 00:35:00 UTC" "2016-12-23 17:02:00 UTC"
## [5] "2016-12-25 17:59:00 UTC"
## 
## Vector of Ends
## [1] "2016-12-05 18:59:00 UTC" "2016-12-06 00:50:00 UTC"
## [3] "2016-12-23 17:45:00 UTC" "2016-12-25 22:35:00 UTC"
## [5] "2016-12-26 05:07:00 UTC"
## 
## 
## Potential problem Detected - very long.  Data to help investigate
## 
## Positions with problems are: 260 266 268 269 270 271 278 280 288 289 294 298 303 304 312 314 319 321 327 328 350 353 354
## Vector of Begins
##  [1] "2016-09-07 01:55:00 UTC" "2016-09-07 02:22:00 UTC"
##  [3] "2016-09-07 12:44:00 UTC" "2016-09-07 14:12:00 UTC"
##  [5] "2016-09-07 15:08:00 UTC" "2016-09-07 17:02:00 UTC"
##  [7] "2016-09-09 00:52:00 UTC" "2016-09-09 05:36:00 UTC"
##  [9] "2016-09-09 07:51:00 UTC" "2016-09-09 11:49:00 UTC"
## [11] "2016-09-09 15:33:00 UTC"
## 
## Vector of Ends
##  [1] "2016-09-07 02:57:00 UTC" "2016-09-07 13:19:00 UTC"
##  [3] "2016-09-07 14:57:00 UTC" "2016-09-07 15:44:00 UTC"
##  [5] "2016-09-07 17:11:00 UTC" "2016-09-09 01:01:00 UTC"
##  [7] "2016-09-09 05:53:00 UTC" "2016-09-09 09:31:00 UTC"
##  [9] "2016-09-09 12:39:00 UTC" "2016-09-09 15:57:00 UTC"
## [11] "2016-09-10 06:10:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  5875 2515
##       TRUE      0  427
## 
## Mismatch at time 2016-09-06 08:53 UTC
## [1] "KMSP 060653Z 21010G17KT 10SM BKN050 BKN060 OVC090 27/22 A2983 RMK AO2 SLP097 T02720217"                                                                                             
## [2] "KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 26/22 A2982 RMK AO2 PK WND 22027/0742 LTG DSNT NE AND S SLP092 TSB35 OCNL LTGICCG OHD TS OHD MOV NE P0015 T02560222"       
## [3] "KMSP 060853Z 24003KT 8SM TS BKN049CB BKN060 OVC080 25/22 A2981 RMK AO2 LTG DSNT ALQDS SLP088 RAB0744E01B17 TSB0735E12B22 OCNL LTGICCG OHD TS OHD MOV NE P0001 60016 T02500222 58008"
## [4] "KMSP 060953Z 25007KT 4SM -TSRA BR SCT027CB BKN041 OVC080 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAB34 TSE01B13E28B38 SLP089 FRQ LTGICCG OHD TS OHD MOV NE P0020 T02390222"              
## [5] "KMSP 061053Z 18003KT 5SM -TSRA BR FEW029 BKN045CB OVC065 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAE01B14E28B46 TSE02B12E36B41 SLP088 OCNL LTGICCG N-E VCTS N-E MOV NE P0005 T02440222"  
## 
## Mismatch at time 2016-09-06 14:53 UTC
## [1] "KMSP 061253Z 00000KT 9SM -RA FEW007 SCT015 BKN100 OVC150 22/21 A2989 RMK AO2 LTG DSNT SE RAE12B37 SLP116 P0004 T02220206"     
## [2] "KMSP 061353Z 21007KT 8SM -RA FEW010 OVC090 22/21 A2996 RMK AO2 RAE00B21 SLP140 P0007 T02220206"                               
## [3] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"
## [4] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"       
## [5] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                   
## 
## Mismatch at time 2016-09-06 16:53 UTC
## [1] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"       
## [2] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [3] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [4] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [5] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## 
## Mismatch at time 2016-09-06 17:53 UTC
## [1] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [2] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [3] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [4] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [5] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## 
## Mismatch at time 2016-09-06 18:53 UTC
## [1] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [2] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [3] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [4] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## [5] "KMSP 062053Z 19003KT 10SM FEW018 BKN041 BKN130 BKN250 25/19 A2994 RMK AO2 LTG DSNT SE AND S SLP134 60000 T02500189 50002"
tmpMSP2$testAllBegins[lubridate::date(tmpMSP2$testAllBegins)=="2016-09-06"]
##  [1] "2016-09-06 07:44:00 UTC" "2016-09-06 07:53:00 UTC"
##  [3] "2016-09-06 09:34:00 UTC" "2016-09-06 10:14:00 UTC"
##  [5] "2016-09-06 10:46:00 UTC" "2016-09-06 11:16:00 UTC"
##  [7] "2016-09-06 12:37:00 UTC" "2016-09-06 13:21:00 UTC"
##  [9] "2016-09-06 15:13:00 UTC" "2016-09-06 19:20:00 UTC"
tmpMSP2$testAllEnds[lubridate::date(tmpMSP2$testAllEnds)=="2016-09-06"]
## [1] "2016-09-06 08:01:00 UTC" "2016-09-06 10:01:00 UTC"
## [3] "2016-09-06 10:28:00 UTC" "2016-09-06 11:01:00 UTC"
## [5] "2016-09-06 12:12:00 UTC" "2016-09-06 13:00:00 UTC"
## [7] "2016-09-06 14:47:00 UTC" "2016-09-06 16:40:00 UTC"
## [9] "2016-09-06 19:49:00 UTC"

The process is correctly noting that it wants to delete the erroneous 0744 record, though it remains in the data causing the same mismatch errors.

The issue appears to be driven by getBeginEndTimeVector, which will be the next function to update.

A revisit to the precipitation extraction seems in order. Broadly, the extraction process is designed to get two things:

  • Precipitation state (yes/no by type) for each entry in the METAR
  • Precipitation ranges (start and end times for each interval of precipitation type)

Extracting the precipitation state is simple, and is based on ddddddZ…pType…RMK occurring somewhere in the METAR.

Extracting the data for constructing the ranges is based on RMK…pTypeBnnEnnBnn… type records. There is a need to convert the range data to date-time, to check that the ranges are consistent with the extracted precipitation states, and to make a best guess change to the range data to drive the alignment.

This suggests three functions:

  • Function #1: Extract the precipitation state and precipitation range data for each entry
  • Function #2: Convert ranges data to date-time
  • Function #3: Correct for any inconsistencies in ranges date-time and METAR precipitation states

Example code includes:

# Function to extract precipitation state and range data from METAR
fnPrecip1 <- function(df, pType, showRegex=TRUE) {

    # The remarks pattern is created based on the precipitation type
    keyPattern <- paste0("(", pType, "[B|E]\\d+[0-9BE]*)")
    if (showRegex) { cat("\nRegex search code is:", keyPattern, "\n") }
    
    # Extract the current precipitation state, lagged precipitation state, and range data from METAR
    df <- df %>% 
        select(origMETAR, dtime) %>% 
        mutate(strPrecip=str_extract(origMETAR, pattern=keyPattern), 
               curPrecip=str_detect(origMETAR, pattern=paste0("\\d{6}Z.*", pType, ".*RMK")),
               lagPrecip=lag(curPrecip, 1)
               )
    
    # Confirm that df is unique by dtime (algorithm depends on this)
    dups <- df %>%
        select(dtime) %>%
        duplicated() %>%
        any()
    if (dups) {
        stop("\nThere are duplicate values of dtime - investigate and fix\n")
    }
    
    # Returnt the file
    df
    
}

msp_precip1 <- fnPrecip1(kmsp_2016, pType="(?<!FZ)RA")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
msp_precip1
## # A tibble: 8,817 x 5
##    origMETAR                   dtime               strPrecip curPrecip lagPrecip
##    <chr>                       <dttm>              <chr>     <lgl>     <lgl>    
##  1 KMSP 310053Z 27008KT 4SM -~ 2015-12-31 00:53:00 <NA>      FALSE     NA       
##  2 KMSP 310153Z 23011KT 5SM -~ 2015-12-31 01:53:00 <NA>      FALSE     FALSE    
##  3 KMSP 310253Z 25007KT 9SM -~ 2015-12-31 02:53:00 <NA>      FALSE     FALSE    
##  4 KMSP 310353Z 27008KT 10SM ~ 2015-12-31 03:53:00 <NA>      FALSE     FALSE    
##  5 KMSP 310453Z 26007KT 10SM ~ 2015-12-31 04:53:00 <NA>      FALSE     FALSE    
##  6 KMSP 310553Z 23004KT 9SM F~ 2015-12-31 05:53:00 <NA>      FALSE     FALSE    
##  7 KMSP 310653Z 22003KT 8SM O~ 2015-12-31 06:53:00 <NA>      FALSE     FALSE    
##  8 KMSP 310753Z 23005KT 9SM O~ 2015-12-31 07:53:00 <NA>      FALSE     FALSE    
##  9 KMSP 310853Z 22006KT 9SM S~ 2015-12-31 08:53:00 <NA>      FALSE     FALSE    
## 10 KMSP 310953Z 23006KT 9SM C~ 2015-12-31 09:53:00 <NA>      FALSE     FALSE    
## # ... with 8,807 more rows

The next function splits out the B/E data in strPrecip and creates times for each:

# Helper function to zero-pad
zeroPad <- function(x, width=2, side="left", pad="0") {
    
    str_pad(x, width=width, side=side, pad=pad)
    
}

# Helper function to take a date-time and a minutes and create a new date-time
helperBEDateTime <- function(dt, mins) {
    
    # Create the date as character (lubridate and dplyr do not always work well together)
    dateChar <- ifelse(str_length(mins)==4 & str_sub(mins, 1, 2)=="23", 
                       as.character(lubridate::as_date(lubridate::date(dt)-1)), 
                       as.character(lubridate::date(dt))
                       )
    
    # Create the hours and minutes as character
    hrMinChar <- ifelse(str_length(mins)==4, 
                        mins, 
                        paste0(zeroPad(lubridate::hour(dt)), mins)
                        )
    
    # Return the appropriate date-time
    lubridate::ymd_hm(paste0(dateChar, " ", hrMinChar))

}

# Function to create times for each piece of B/E data
fnPrecip2 <- function(df) {

    # Create the begin and end times by splitting precipString, then format as tibble and reattach dtime
    bEData <- df %>%
        pull(strPrecip) %>%
        str_extract_all(pattern="[BE]\\d+", simplify=TRUE) %>%
        as.data.frame(stringsAsFactors=FALSE) %>%
        tibble::as_tibble() %>%
        bind_cols(select(df, dtime))
    
    # Pivot longer, create a dummy record where is.na(V1), and split in to time and state change
    bEData <- bEData %>%
        pivot_longer(-dtime) %>%
        mutate(value=ifelse(name=="V1" & is.na(value), paste0("N", lubridate::minute(dtime)), value)) %>%
        filter(value != "") %>%
        mutate(chgType=str_sub(value, 1, 1), 
               chgNum=str_sub(value, 2, -1), 
               chgTime=helperBEDateTime(dt=dtime, mins=chgNum)
               )
    
    # Return the data
    bEData
    
}

msp_precip2 <- fnPrecip2(msp_precip1)
msp_precip2
## # A tibble: 9,017 x 6
##    dtime               name  value chgType chgNum chgTime            
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>             
##  1 2015-12-31 00:53:00 V1    N53   N       53     2015-12-31 00:53:00
##  2 2015-12-31 01:53:00 V1    N53   N       53     2015-12-31 01:53:00
##  3 2015-12-31 02:53:00 V1    N53   N       53     2015-12-31 02:53:00
##  4 2015-12-31 03:53:00 V1    N53   N       53     2015-12-31 03:53:00
##  5 2015-12-31 04:53:00 V1    N53   N       53     2015-12-31 04:53:00
##  6 2015-12-31 05:53:00 V1    N53   N       53     2015-12-31 05:53:00
##  7 2015-12-31 06:53:00 V1    N53   N       53     2015-12-31 06:53:00
##  8 2015-12-31 07:53:00 V1    N53   N       53     2015-12-31 07:53:00
##  9 2015-12-31 08:53:00 V1    N53   N       53     2015-12-31 08:53:00
## 10 2015-12-31 09:53:00 V1    N53   N       53     2015-12-31 09:53:00
## # ... with 9,007 more rows

Then a function can be written to extract issues where the times are out of sequence:

# Function to extract issues - continuity, distance from METAR
fnPrecip3 <- function(df) {

    issueCheck <- df %>%
        mutate(deltaMETAR=dtime-chgTime, 
               deltaPrev=ifelse(row_number()==1, 3600, chgTime-lag(chgTime, 1)), 
               issueCons=(deltaMETAR < 0 | deltaMETAR > 3599 | deltaPrev <= 0)
               )
    
    cat("\nContinuity or consistency error - record(s) will be deleted\n")
    issueCheck %>%
        filter(issueCons) %>%
        print()
    
    issueCheck %>%
        select(-deltaMETAR, -deltaPrev) %>%
        filter(!issueCons)
    
}

msp_precip3 <- fnPrecip3(msp_precip2)
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-09-06 08:53:00 V1    B0744 B       0744   2016-09-06 07:44:00 4140 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
msp_precip3
## # A tibble: 9,016 x 7
##    dtime               name  value chgType chgNum chgTime             issueCons
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>    
##  1 2015-12-31 00:53:00 V1    N53   N       53     2015-12-31 00:53:00 FALSE    
##  2 2015-12-31 01:53:00 V1    N53   N       53     2015-12-31 01:53:00 FALSE    
##  3 2015-12-31 02:53:00 V1    N53   N       53     2015-12-31 02:53:00 FALSE    
##  4 2015-12-31 03:53:00 V1    N53   N       53     2015-12-31 03:53:00 FALSE    
##  5 2015-12-31 04:53:00 V1    N53   N       53     2015-12-31 04:53:00 FALSE    
##  6 2015-12-31 05:53:00 V1    N53   N       53     2015-12-31 05:53:00 FALSE    
##  7 2015-12-31 06:53:00 V1    N53   N       53     2015-12-31 06:53:00 FALSE    
##  8 2015-12-31 07:53:00 V1    N53   N       53     2015-12-31 07:53:00 FALSE    
##  9 2015-12-31 08:53:00 V1    N53   N       53     2015-12-31 08:53:00 FALSE    
## 10 2015-12-31 09:53:00 V1    N53   N       53     2015-12-31 09:53:00 FALSE    
## # ... with 9,006 more rows

Then a function can be written to find issues where there are the wrong number, or wrong ordering, of begin and end times:

# Function to find wrong sequence of times
fnPrecip4 <- function(dfTimes, dfOrig) {
    
    # Pivot dfTimes back to a single record, discarding any of the consistency issues
    # OK if not all dtimes included; will get from dfOrig
    dfCheck <- dfTimes %>%
        filter(!issueCons) %>%
        mutate(newValue=case_when(chgType=="N" ~ 0, chgType=="B" ~ 1, chgType=="E" ~ -1)) %>%
        pivot_wider(dtime, names_from="name", values_from="newValue") %>%
        right_join(select(dfOrig, dtime, curPrecip, lagPrecip), by="dtime") %>%
        mutate_if(is.numeric, ~ifelse(is.na(.), 0, .))
    
    # Create the cumsum of state change
    cs <- dfCheck %>%
        select(-dtime, -curPrecip) %>%
        mutate(lagPrecip=as.integer(lagPrecip)) %>%
        select(lagPrecip, everything()) %>%
        apply(1, FUN=cumsum) %>%
        t()
    
    # Create the list of issues
    # If cumsum is every anything other than 1 or 0 there is a problem
    # If the last value of cumsum does not equal curPrecip there is a problem
    issue01 <- which(apply(cs, 1, FUN=max) > 1 | apply(cs, 1, FUN=min) < 0)
    issuelc <- which(dfCheck$curPrecip != cs[, ncol(cs)])
    issueAll <- sort(unique(c(issue01, issuelc)))
    
    cat("\n\nIssues with begin when precipitation or end when no precipitation:", length(issue01))
    cat("\nIssues where state change does not add to the total:", length(issuelc))
    cat("\nTotal issues (may be less than sum due to same datetime in both):", length(issueAll), "\n\n")
    # print(dfCheck[issueAll, ] %>% select(dtime, lagPrecip, everything()))
    
    dfCheck[sort(unique(c(issue01, issuelc))), "dtime"]
    
}

# Get the list of bad date-times
msp_precip4 <- fnPrecip4(msp_precip3, msp_precip1)
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 8
## Total issues (may be less than sum due to same datetime in both): 8

Then, a function can work through only the bad date-times, leaving the good date-times “as is”

# Function to work through each bad date-time and suggest new ranges
fnPrecip5 <- function(df, issueTimes, lagCurFrame) {
    
    # Create a dataframe for issueTimes
    issues <- issueTimes %>%
        mutate(issue=TRUE)
    
    # Split df in to issues and non-issue
    df <- df %>%
        left_join(issues, by="dtime") %>%
        mutate(issue=ifelse(is.na(issue), FALSE, issue), 
               lastRecord=(row_number()==n())
               ) %>%
        left_join(select(lagCurFrame, dtime, lagPrecip, curPrecip), by="dtime")
    
    dfIssues <- df %>%
        filter(issue)
    dfNoIssues <- df %>%
        filter(!issue)
    
    # Note the records to be worked through
    cat("\nRecords to be addressed include:\n")
    dfIssues %>%
        select(-issueCons, -issue) %>%
        print()

    # Work through a record by starting with lagPrecip
    # If first record inconsistent with lagPrecip, flag for deletion and keep state; update state otherwise
    # if next record inconsistent with previous state, flag for deletion and keep state; update otherwise
    # If final record inconsistent with curPrecip, add a record using dtime as the time
    
    # Get the unique times with issues
    dtimeVec <- dfIssues %>% pull(dtime) %>% unique()
    
    # Create empty vectors for deletes and adds
    delVec <- as.POSIXct(character(0))
    addBegin <- as.POSIXct(character(0))
    addEnd <- as.POSIXct(character(0))
    
    # Populate the delete and add vectors
    for (dt in dtimeVec) {
        
        # Pull the records for this time
        dtRecords <- dfIssues %>% filter(dtime==lubridate::as_datetime(dt))
        
        # Initialize the previous state to lagPrecip and the error vector to blank
        preState <-dtRecords$lagPrecip[1]
        
        # Loop through and flag state change errors
        for (ctr in 1:nrow(dtRecords)) {
            
            # Grow the deletions vector
            if ((!preState & dtRecords$chgType[ctr]=="E") | (preState & dtRecords$chgType[ctr]=="B")) {
                delVec <- c(delVec, dtRecords$chgTime[ctr])
                # do not modify the state, it has not changed due to the deletion
            } else {
                # do not grow the deletion vector, the state change is OK here if chgType is not "N"
                if (dtRecords$chgType[ctr] != "N") { preState <- !preState }
            }
            
            # Create a single addition if needed
            if (ctr==nrow(dtRecords) & preState != dtRecords$curPrecip[ctr]) {
                if (dtRecords$curPrecip[ctr]) { addBegin <- c(addBegin, dt) }
                if (!dtRecords$curPrecip[ctr]) { addEnd <- c(addEnd, dt) }
            }
        }
        
    }
    
    # If there is precipitation at the very end, addEnd for dTime+1
    fixEnd <- dfNoIssues %>%
        filter(lastRecord & curPrecip) %>%
        mutate(timeUse=dtime+1) %>%
        pull(timeUse)
    if (length(fixEnd) > 0) {
        addEnd <- c(addEnd, fixEnd)
        cat("\nAdding final end time to cap interval at end of file:", as.character(fixEnd), "\n")
    }
    
    # Print the key vectors
    cat("\nStart/end times deleted:\n")
    print(lubridate::as_datetime(delVec))
    cat("\nBegin times added\n")
    print(lubridate::as_datetime(addBegin))
    cat("\nEnd times added:\n")
    print(lubridate::as_datetime(addEnd))
    
    # Create the full list of issue start times
    beginIssues <- dfIssues %>%
        filter(chgType=="B") %>%
        pull(chgTime)
    endIssues <- dfIssues %>%
        filter(chgType=="E") %>%
        pull(chgTime)
    
    beginIssues <- c(beginIssues[!beginIssues %in% delVec], addBegin)
    endIssues <- c(endIssues[!endIssues %in% delVec], addEnd)

    # cat("\nNew begin times list from issues:\n")
    # print(beginIssues)
    # cat("\nNew end times list from issues:\n")
    # print(endIssues)
    
    # Create the full list of start and end times
    beginOK <- dfNoIssues %>%
        filter(chgType=="B") %>%
        pull(chgTime)
    endOK <- dfNoIssues %>%
        filter(chgType=="E") %>%
        pull(chgTime)
    
    # Return a list of beginTimes and endTimes
    list(beginTimes=sort(c(beginOK, beginIssues)), 
         endTimes=sort(c(endOK, endIssues))
         )
    
}


msp_precip5 <- fnPrecip5(msp_precip3, issueTimes=msp_precip4, lagCurFrame=msp_precip1)
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-05 15:53:00 V1    E22   E       22     2016-04-05 15:22:00 FALSE     
## 2 2016-05-29 19:53:00 V1    B31   B       31     2016-05-29 19:31:00 FALSE     
## 3 2016-06-20 04:53:00 V1    B11   B       11     2016-06-20 04:11:00 FALSE     
## 4 2016-09-06 07:53:00 V1    N53   N       53     2016-09-06 07:53:00 FALSE     
## 5 2016-09-06 08:53:00 V2    E01   E       01     2016-09-06 08:01:00 FALSE     
## 6 2016-09-06 08:53:00 V3    B17   B       17     2016-09-06 08:17:00 FALSE     
## 7 2016-10-18 04:53:00 V1    E35   E       35     2016-10-18 04:35:00 FALSE     
## 8 2016-11-22 13:53:00 V1    N53   N       53     2016-11-22 13:53:00 FALSE     
## 9 2016-11-29 07:53:00 V1    B18   B       18     2016-11-29 07:18:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-10-18 04:35:00 UTC" "2016-11-29 07:18:00 UTC"
## 
## Begin times added
## [1] "2016-04-05 15:53:00 UTC" "2016-09-06 07:53:00 UTC"
## 
## End times added:
## [1] "2016-05-29 19:53:00 UTC" "2016-06-20 04:53:00 UTC"
## [3] "2016-09-06 08:53:00 UTC" "2016-11-22 13:53:00 UTC"

The intervals can then be checked for consistency:

fnPrecip6 <- function(lst, df){

    # Extract the beginning and interval times
    begins <- lst[["beginTimes"]]
    ends <- lst[["endTimes"]]
    durs <- ends - begins

    # Create intervals from the raw list file
    precipInts <- interval(begins, ends - 1) # make the interval stop the minute before the end time
    
    # Extract the METAR and date-time information and check for overlaps
    dtime <- df %>% pull(dtime)
    metar <- df %>% pull(origMETAR)
    precipMETAR <- df %>% pull(curPrecip)
    intMETAR <- sapply(dtime, FUN=function(x) {x %within% precipInts %>% any()})

    # Check for the consistency of the observations and print the mismatches
    print(table(precipMETAR, intMETAR))

    mism <- which(precipMETAR != intMETAR)
    if (length(mism) == 0) {
        cat("\nFull matches between METAR observations and intervals\n")
    } else {
        for (x in mism) {
            cat("\nMismatch at time", strftime(dtime[x], format="%Y-%m-%d %H:%M", tz="UTC"), "UTC\n")
            print(metar[max(1, x-2):min(length(metar), x+2)])
        }
    }
    
    list(beginTimes=lst$beginTimes, endTimes=lst$endTimes, 
         precipInts=precipInts, mismatches=mism, mismatchTimes=dtime[mism]
         )
    
}

msp_precip6 <- fnPrecip6(msp_precip5, msp_precip1)
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8390    0
##       TRUE      0  427
## 
## Full matches between METAR observations and intervals

The algorithm correctly fixes the Minneapolis issues, though it is still rather clunky. Next steps are to make a general algorithm and run against all locales.

A final function is built to combine the precipitation functions:

# Function to combine precipitation functions
combinePrecipFunctions <- function(df, 
                                   pType="(?<!FZ)RA", 
                                   errorLengthInconsistency=TRUE
                                   ) {
    
    # FUNCTION ARGUMENTS:
    # df: data frame or tibble containing processed METAR data, including dtime and origMETAR
    # pType: regex code for the precipitation type of interest
    # errorLengthInconsistency: boolean, whether to error out if beginTimes, endTimes, or precipInts have different lengths or imply intervals of non-positive durection
    
    # Extract the current precipitation state, lagged precipitation state, and string of begin/end times
    precip1 <- fnPrecip1(df, pType=pType)
    
    # Find and delete continuity (next before previous) and consistency (belongs to wrong METAR) errors
    precip3 <- fnPrecip2(precip1) %>%
        fnPrecip3()
    
    # Find issues with precipitation begin and end times compared to each other and precipitation states
    precip4 <- fnPrecip4(precip3, precip1)
    
    # Correct for the issues with begin and end times and create a vector of begin times and end times
    precip5 <- fnPrecip5(precip3, issueTimes=precip4, lagCurFrame=precip1)
    
    # Check intervals for consistency with precipitation states
    precip6 <- fnPrecip6(precip5, precip1)
    
    # Extract beginTimes, endTimes, and precipInts
    beginTimes <- precip6$beginTimes
    endTimes <- precip6$endTimes
    precipInts <- precip6$precipInts
    
    # Check that the intervals are all positive and with same number of begin times and end times
    lenBT <- length(beginTimes)
    lenET <- length(endTimes)
    lenPI <- length(precipInts)
    piNotPositive <- (lubridate::time_length(precipInts) <= 0)
    
    # Check for same lengths    
    if (max(lenBT, lenET, lenPI) != min(lenBT, lenET, lenPI)) {
        cat("Inconsistent lengths - beginTimes:", lenBT, ", endTimes:", lenET, "precipInts:", lenPI, "\n")
        if (errorLengthInconsistency) { 
            stop("Exiting due to inconsistent lengths") 
        } else {
            cat("\nContinuing to process, output frame will append NA where needed\n")
            maxLen <- max(lenBT, lenET, lenPI)
            beginTimes <- c(beginTimes, rep(NA, maxLen-lenBT))
            endTimes <- c(endTimes, rep(NA, maxLen-lenET))
            precipInts <- c(precipInts, rep(NA, maxLen-lenPI))
        }
    }
    
    # Check for positive intervals
    if (sum(piNotPositive) > 0) {
        cat("\nPrecipitation intervals are non-positive:", 
            sum(piNotPositive), 
            "at positions",
            which(piNotPositive), 
            "\n"
            )
        if (errorLengthInconsistency) { stop("Exiting due to inconsistency in interval lengths") } 
    }
    
    
    # Return the relevant information
    list(pStateFrame=precip1, 
         pIssueTimes=precip4,
         mismatches=precip6$mismatches,
         mismatchTimes=precip6$mismatchTimes,
         beginEndInterval=tibble::tibble(beginTimes=beginTimes, endTime=endTimes, precipInts=precipInts)
         )
    
}

The function can then be run for the Minneapolis 2016 rain data:

kmsp_2016_RA <- combinePrecipFunctions(kmsp_2016, pType="(?<!FZ)RA", errorLengthInconsistency=TRUE)
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-09-06 08:53:00 V1    B0744 B       0744   2016-09-06 07:44:00 4140 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 8
## Total issues (may be less than sum due to same datetime in both): 8 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-05 15:53:00 V1    E22   E       22     2016-04-05 15:22:00 FALSE     
## 2 2016-05-29 19:53:00 V1    B31   B       31     2016-05-29 19:31:00 FALSE     
## 3 2016-06-20 04:53:00 V1    B11   B       11     2016-06-20 04:11:00 FALSE     
## 4 2016-09-06 07:53:00 V1    N53   N       53     2016-09-06 07:53:00 FALSE     
## 5 2016-09-06 08:53:00 V2    E01   E       01     2016-09-06 08:01:00 FALSE     
## 6 2016-09-06 08:53:00 V3    B17   B       17     2016-09-06 08:17:00 FALSE     
## 7 2016-10-18 04:53:00 V1    E35   E       35     2016-10-18 04:35:00 FALSE     
## 8 2016-11-22 13:53:00 V1    N53   N       53     2016-11-22 13:53:00 FALSE     
## 9 2016-11-29 07:53:00 V1    B18   B       18     2016-11-29 07:18:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-10-18 04:35:00 UTC" "2016-11-29 07:18:00 UTC"
## 
## Begin times added
## [1] "2016-04-05 15:53:00 UTC" "2016-09-06 07:53:00 UTC"
## 
## End times added:
## [1] "2016-05-29 19:53:00 UTC" "2016-06-20 04:53:00 UTC"
## [3] "2016-09-06 08:53:00 UTC" "2016-11-22 13:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8390    0
##       TRUE      0  427
## 
## Full matches between METAR observations and intervals

And the function can then be run for rain for all of the 2016 data:

fileNames <- grep(pattern="2016", names(cityNameMapper), value=TRUE)
rainList <- vector("list", length(fileNames))
names(rainList) <- paste0(fileNames, "_RA")

for (city in fileNames) {
    
    cat("\n\n*** Processing for:", cityNameMapper[city], "\n")
    # Run the rain extraction for city and store in rainList
    rainList[[paste0(city, "_RA")]] <- combinePrecipFunctions(get(city), 
                                                              pType="(?<!FZ)RA", 
                                                              errorLengthInconsistency=TRUE
                                                              )
    
}
## 
## 
## *** Processing for: Detroit, MI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 7
## Total issues (may be less than sum due to same datetime in both): 7 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-24 11:53:00 V1    N53   N       53     2016-02-24 11:53:00 FALSE     
## 2 2016-04-01 00:53:00 V1    B01   B       01     2016-04-01 00:01:00 FALSE     
## 3 2016-04-01 00:53:00 V2    E12   E       12     2016-04-01 00:12:00 FALSE     
## 4 2016-04-07 22:53:00 V1    N53   N       53     2016-04-07 22:53:00 FALSE     
## 5 2016-04-07 23:53:00 V1    N53   N       53     2016-04-07 23:53:00 FALSE     
## 6 2016-07-14 00:53:00 V1    E30   E       30     2016-07-14 00:30:00 FALSE     
## 7 2016-08-15 00:53:00 V1    B2355 B       2355   2016-08-14 23:55:00 FALSE     
## 8 2016-08-15 00:53:00 V2    E29   E       29     2016-08-15 00:29:00 FALSE     
## 9 2016-08-16 14:53:00 V1    N53   N       53     2016-08-16 14:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-01 00:01:00 UTC" "2016-07-14 00:30:00 UTC"
## [3] "2016-08-14 23:55:00 UTC"
## 
## Begin times added
## [1] "2016-02-24 11:53:00 UTC" "2016-04-07 23:53:00 UTC"
## 
## End times added:
## [1] "2016-04-07 22:53:00 UTC" "2016-08-16 14:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8258    0
##       TRUE      0  560
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Newark, NJ (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-24 19:51:00 V1    B1858 B       1858   2016-02-24 18:58:00 FALSE     
## 2 2016-11-20 08:51:00 V1    E0758 E       0758   2016-11-20 07:58:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-02-24 18:58:00 UTC" "2016-11-20 07:58:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8195    0
##       TRUE      0  626
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Green Bay, WI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-05-02 17:53:00 V1    E1659 E       1659   2016-05-02 16:59:00 FALSE     
## 2 2016-05-02 17:53:00 V2    B39   B       39     2016-05-02 17:39:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-05-02 17:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8378    0
##       TRUE      0  425
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Grand Rapids, MI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-07 12:53:00 V1    E15   E       15     2016-04-07 12:15:00 FALSE     
## 2 2016-06-26 13:53:00 V1    B46   B       46     2016-06-26 13:46:00 FALSE     
## 3 2016-06-26 14:53:00 V1    N53   N       53     2016-06-26 14:53:00 FALSE     
## 4 2016-07-24 15:53:00 V1    N53   N       53     2016-07-24 15:53:00 FALSE     
## 5 2016-08-12 09:53:00 V1    B0854 B       0854   2016-08-12 08:54:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-07 12:15:00 UTC" "2016-08-12 08:54:00 UTC"
## 
## Begin times added
## [1] "2016-06-26 14:53:00 UTC"
## 
## End times added:
## [1] "2016-06-26 13:53:00 UTC" "2016-07-24 15:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8208    0
##       TRUE      0  604
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Houston, TX (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-03-23 22:53:00 V1    B2145 B       2145   2016-03-23 21:45:00 4080 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 7
## Total issues (may be less than sum due to same datetime in both): 7 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-23 22:53:00 V2    E16   E       16     2016-03-23 22:16:00 FALSE     
## 2 2016-06-06 00:53:00 V1    E40   E       40     2016-06-06 00:40:00 FALSE     
## 3 2016-08-14 00:53:00 V1    N53   N       53     2016-08-14 00:53:00 FALSE     
## 4 2016-08-14 22:53:00 V1    N53   N       53     2016-08-14 22:53:00 FALSE     
## 5 2016-08-21 19:53:00 V1    B1857 B       1857   2016-08-21 18:57:00 FALSE     
## 6 2016-08-21 19:53:00 V2    E27   E       27     2016-08-21 19:27:00 FALSE     
## 7 2016-08-21 19:53:00 V3    B45   B       45     2016-08-21 19:45:00 FALSE     
## 8 2016-12-03 18:53:00 V1    E12   E       12     2016-12-03 18:12:00 FALSE     
## 9 2016-12-04 10:53:00 V1    N53   N       53     2016-12-04 10:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-03-23 22:16:00 UTC" "2016-06-06 00:40:00 UTC"
## [3] "2016-12-03 18:12:00 UTC"
## 
## Begin times added
## [1] "2016-08-14 00:53:00 UTC" "2016-08-14 22:53:00 UTC"
## 
## End times added:
## [1] "2016-08-21 19:53:00 UTC" "2016-12-04 10:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8368    0
##       TRUE      0  448
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Indianapolis, IN (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 6 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-07 23:54:00 V1    N54   N       54     2016-04-07 23:54:00 FALSE     
## 2 2016-05-08 19:54:00 V1    N54   N       54     2016-05-08 19:54:00 FALSE     
## 3 2016-07-18 17:54:00 V1    N54   N       54     2016-07-18 17:54:00 FALSE     
## 4 2016-08-07 19:54:00 V1    N54   N       54     2016-08-07 19:54:00 FALSE     
## 5 2016-09-05 21:54:00 V1    B01   B       01     2016-09-05 21:01:00 FALSE     
## 6 2016-09-05 21:54:00 V2    E38   E       38     2016-09-05 21:38:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-09-05 21:01:00 UTC"
## 
## Begin times added
## [1] "2016-07-18 17:54:00 UTC"
## 
## End times added:
## [1] "2016-04-07 23:54:00 UTC" "2016-05-08 19:54:00 UTC"
## [3] "2016-08-07 19:54:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8144    0
##       TRUE      0  623
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Las Vegas, NV (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-30 23:56:00 V1    N56   N       56     2016-04-30 23:56:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-04-30 23:56:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8710    0
##       TRUE      0  108
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Lincoln, NE (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-07-13 18:54:00 V1    N54   N       54     2016-07-13 18:54:00 FALSE     
## 2 2016-08-29 22:54:00 V1    N54   N       54     2016-08-29 22:54:00 FALSE     
## 3 2016-09-16 06:54:00 V1    N54   N       54     2016-09-16 06:54:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-08-29 22:54:00 UTC" "2016-09-16 06:54:00 UTC"
## 
## End times added:
## [1] "2016-07-13 18:54:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8500    0
##       TRUE      0  313
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Milwaukee, WI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 4 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-10 10:52:00 V1    N52   N       52     2016-04-10 10:52:00 FALSE     
## 2 2016-06-10 03:52:00 V1    N52   N       52     2016-06-10 03:52:00 FALSE     
## 3 2016-08-20 13:52:00 V1    N52   N       52     2016-08-20 13:52:00 FALSE     
## 4 2016-10-07 02:52:00 V1    N52   N       52     2016-10-07 02:52:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-04-10 10:52:00 UTC" "2016-06-10 03:52:00 UTC"
## [3] "2016-08-20 13:52:00 UTC" "2016-10-07 02:52:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8346    0
##       TRUE      0  462
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Madison, WI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-09-07 12:53:00 V1    E52   E       52     2016-09-07 12:52:00 FALSE     
## 2 2016-09-16 11:53:00 V1    N53   N       53     2016-09-16 11:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-09-07 12:52:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-09-16 11:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8347    0
##       TRUE      0  451
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Minneapolis, MN (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-09-06 08:53:00 V1    B0744 B       0744   2016-09-06 07:44:00 4140 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 8
## Total issues (may be less than sum due to same datetime in both): 8 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-05 15:53:00 V1    E22   E       22     2016-04-05 15:22:00 FALSE     
## 2 2016-05-29 19:53:00 V1    B31   B       31     2016-05-29 19:31:00 FALSE     
## 3 2016-06-20 04:53:00 V1    B11   B       11     2016-06-20 04:11:00 FALSE     
## 4 2016-09-06 07:53:00 V1    N53   N       53     2016-09-06 07:53:00 FALSE     
## 5 2016-09-06 08:53:00 V2    E01   E       01     2016-09-06 08:01:00 FALSE     
## 6 2016-09-06 08:53:00 V3    B17   B       17     2016-09-06 08:17:00 FALSE     
## 7 2016-10-18 04:53:00 V1    E35   E       35     2016-10-18 04:35:00 FALSE     
## 8 2016-11-22 13:53:00 V1    N53   N       53     2016-11-22 13:53:00 FALSE     
## 9 2016-11-29 07:53:00 V1    B18   B       18     2016-11-29 07:18:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-10-18 04:35:00 UTC" "2016-11-29 07:18:00 UTC"
## 
## Begin times added
## [1] "2016-04-05 15:53:00 UTC" "2016-09-06 07:53:00 UTC"
## 
## End times added:
## [1] "2016-05-29 19:53:00 UTC" "2016-06-20 04:53:00 UTC"
## [3] "2016-09-06 08:53:00 UTC" "2016-11-22 13:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8390    0
##       TRUE      0  427
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: New Orleans, LA (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2017-01-01 09:53:00 V1    E0852 E       0852   2017-01-01 08:52:00 3660 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-11 20:53:00 V1    B07   B       07     2016-03-11 20:07:00 FALSE     
## 2 2016-03-27 16:53:00 V1    N53   N       53     2016-03-27 16:53:00 FALSE     
## 3 2016-08-14 23:53:00 V1    B39   B       39     2016-08-14 23:39:00 FALSE     
## 4 2016-12-13 23:53:00 V1    E2258 E       2258   2016-12-13 22:58:00 FALSE     
## 5 2016-12-13 23:53:00 V2    B30   B       30     2016-12-13 23:30:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Adding final end time to cap interval at end of file: 2017-01-01 23:53:01 
## 
## Start/end times deleted:
## [1] "2016-03-11 20:07:00 UTC" "2016-12-13 22:58:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-03-27 16:53:00 UTC" "2016-08-14 23:53:00 UTC"
## [3] "2017-01-01 23:53:01 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8375    0
##       TRUE      0  438
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Chicago, IL (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-24 21:51:00 V1    N51   N       51     2016-03-24 21:51:00 FALSE     
## 2 2016-07-13 18:51:00 V1    N51   N       51     2016-07-13 18:51:00 FALSE     
## 3 2016-08-29 19:51:00 V1    N51   N       51     2016-08-29 19:51:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-03-24 21:51:00 UTC" "2016-08-29 19:51:00 UTC"
## 
## End times added:
## [1] "2016-07-13 18:51:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: San Diego, CA (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8612    0
##       TRUE      0  198
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Traverse City, MI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 7 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-28 10:53:00 V1    E0957 E       0957   2016-03-28 09:57:00 FALSE     
## 2 2016-06-06 20:53:00 V1    B39   B       39     2016-06-06 20:39:00 FALSE     
## 3 2016-06-06 20:53:00 V2    E49   E       49     2016-06-06 20:49:00 FALSE     
## 4 2016-06-06 21:53:00 V1    B17   B       17     2016-06-06 21:17:00 FALSE     
## 5 2016-06-06 21:53:00 V2    E38   E       38     2016-06-06 21:38:00 FALSE     
## 6 2016-09-28 19:53:00 V1    N53   N       53     2016-09-28 19:53:00 FALSE     
## 7 2016-12-02 16:53:00 V1    E29   E       29     2016-12-02 16:29:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-03-28 09:57:00 UTC" "2016-06-06 21:17:00 UTC"
## [3] "2016-12-02 16:29:00 UTC"
## 
## Begin times added
## [1] "2016-06-06 20:53:00 UTC"
## 
## End times added:
## [1] "2016-09-28 19:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8345    0
##       TRUE      0  469
## 
## Full matches between METAR observations and intervals

The rain data parsed successfully, with all intervals of the proper length.

The function can also be run for snow for all of the 2016 data:

fileNames <- grep(pattern="2016", names(cityNameMapper), value=TRUE)
snowList <- vector("list", length(fileNames))
names(snowList) <- paste0(fileNames, "_SN")

for (city in fileNames) {
    
    cat("\n\n*** Processing for:", cityNameMapper[city], "\n")
    # Run the snow extraction for city and store in snowList
    snowList[[paste0(city, "_SN")]] <- combinePrecipFunctions(get(city), 
                                                              pType="(?<!BL)SN", 
                                                              errorLengthInconsistency=TRUE
                                                              )
    
}
## 
## 
## *** Processing for: Detroit, MI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-09 10:53:00 V1    N53   N       53     2016-02-09 10:53:00 FALSE     
## 2 2016-02-24 11:53:00 V1    B12   B       12     2016-02-24 11:12:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-02-09 10:53:00 UTC"
## 
## End times added:
## [1] "2016-02-24 11:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8344    0
##       TRUE      0  474
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Newark, NJ (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-10 14:51:00 V1    N51   N       51     2016-02-10 14:51:00 FALSE     
## 2 2016-11-20 07:51:00 V1    N51   N       51     2016-11-20 07:51:00 FALSE     
## 3 2016-11-20 08:51:00 V1    E0756 E       0756   2016-11-20 07:56:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-11-20 07:56:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-02-10 14:51:00 UTC" "2016-11-20 07:51:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8683    0
##       TRUE      0  138
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Green Bay, WI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-12-07 15:53:00 V1    E09   E       09     2016-12-07 15:09:00 FALSE     
## 2 2016-12-07 15:53:00 V2    B38   B       38     2016-12-07 15:38:00 FALSE     
## 3 2016-12-29 17:53:00 V1    N53   N       53     2016-12-29 17:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-12-07 15:09:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-12-29 17:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8460    0
##       TRUE      0  343
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Grand Rapids, MI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-04 20:53:00 V1    N53   N       53     2016-02-04 20:53:00 FALSE     
## 2 2016-04-02 16:53:00 V1    N53   N       53     2016-04-02 16:53:00 FALSE     
## 3 2016-04-02 17:53:00 V1    N53   N       53     2016-04-02 17:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-02-04 20:53:00 UTC" "2016-04-02 17:53:00 UTC"
## 
## End times added:
## [1] "2016-04-02 16:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8137    0
##       TRUE      0  675
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Houston, TX (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8816
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Indianapolis, IN (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8529    0
##       TRUE      0  238
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Las Vegas, NV (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8818
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Lincoln, NE (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-19 12:54:00 V1    B1155 B       1155   2016-01-19 11:55:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-01-19 11:55:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8719    0
##       TRUE      0   94
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Milwaukee, WI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-26 01:52:00 V1    E39   E       39     2016-01-26 01:39:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-01-26 01:39:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8462    0
##       TRUE      0  346
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Madison, WI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 6
## Total issues (may be less than sum due to same datetime in both): 6 
## 
## 
## Records to be addressed include:
## # A tibble: 6 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-09 14:53:00 V1    E39   E       39     2016-01-09 14:39:00 FALSE     
## 2 2016-01-12 12:53:00 V1    N53   N       53     2016-01-12 12:53:00 FALSE     
## 3 2016-02-09 04:53:00 V1    N53   N       53     2016-02-09 04:53:00 FALSE     
## 4 2016-02-09 05:53:00 V1    N53   N       53     2016-02-09 05:53:00 FALSE     
## 5 2016-04-08 14:53:00 V1    E36   E       36     2016-04-08 14:36:00 FALSE     
## 6 2016-12-12 05:53:00 V1    E01   E       01     2016-12-12 05:01:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-01-09 14:39:00 UTC" "2016-04-08 14:36:00 UTC"
## [3] "2016-12-12 05:01:00 UTC"
## 
## Begin times added
## [1] "2016-02-09 05:53:00 UTC"
## 
## End times added:
## [1] "2016-01-12 12:53:00 UTC" "2016-02-09 04:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8435    0
##       TRUE      0  363
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Minneapolis, MN (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 4 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-12 00:53:00 V1    N53   N       53     2016-01-12 00:53:00 FALSE     
## 2 2016-12-07 14:53:00 V1    N53   N       53     2016-12-07 14:53:00 FALSE     
## 3 2016-12-07 15:53:00 V1    N53   N       53     2016-12-07 15:53:00 FALSE     
## 4 2016-12-08 22:53:00 V1    N53   N       53     2016-12-08 22:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-01-12 00:53:00 UTC" "2016-12-07 15:53:00 UTC"
## [3] "2016-12-08 22:53:00 UTC"
## 
## End times added:
## [1] "2016-12-07 14:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8438    0
##       TRUE      0  379
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: New Orleans, LA (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8813
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Chicago, IL (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-10 00:51:00 V1    N51   N       51     2016-02-10 00:51:00 FALSE     
## 2 2016-03-24 21:51:00 V1    N51   N       51     2016-03-24 21:51:00 FALSE     
## 3 2016-04-08 22:51:00 V1    N51   N       51     2016-04-08 22:51:00 FALSE     
## 4 2016-04-09 04:51:00 V1    E46   E       46     2016-04-09 04:46:00 FALSE     
## 5 2016-12-24 01:51:00 V1    N51   N       51     2016-12-24 01:51:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-09 04:46:00 UTC"
## 
## Begin times added
## [1] "2016-02-10 00:51:00 UTC" "2016-03-24 21:51:00 UTC"
## [3] "2016-04-08 22:51:00 UTC" "2016-12-24 01:51:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8532    0
##       TRUE      0  283
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: San Diego, CA (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8810
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Traverse City, MI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-12-02 16:53:00 V1    E40   E       40     2016-12-02 16:40:00 FALSE     
## 2 2016-12-18 06:53:00 V1    N53   N       53     2016-12-18 06:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-12-02 16:40:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-12-18 06:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  7828    0
##       TRUE      0  986
## 
## Full matches between METAR observations and intervals

The snow data parsed successfully, with all intervals of the proper length.

The function can also be run for thunderstorm for all of the 2016 data:

fileNames <- grep(pattern="2016", names(cityNameMapper), value=TRUE)
thunderList <- vector("list", length(fileNames))
names(thunderList) <- paste0(fileNames, "_TS")

for (city in fileNames) {
    
    cat("\n\n*** Processing for:", cityNameMapper[city], "\n")
    # Run the thunder extraction for city and store in thunderList
    thunderList[[paste0(city, "_TS")]] <- combinePrecipFunctions(get(city), 
                                                                 pType="(?<!VC)TS", 
                                                                 errorLengthInconsistency=TRUE
                                                                 )
    
}
## 
## 
## *** Processing for: Detroit, MI (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-07-14 00:53:00 V1    E30   E       30     2016-07-14 00:30:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-07-14 00:30:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8784    0
##       TRUE      0   34
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Newark, NJ (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-24 19:51:00 V1    N51   N       51     2016-02-24 19:51:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-02-24 19:51:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8785    0
##       TRUE      0   36
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Green Bay, WI (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8782    0
##       TRUE      0   21
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Grand Rapids, MI (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-05-12 08:53:00 V1    E47   E       47     2016-05-12 08:47:00 FALSE     
## 2 2016-06-15 07:53:00 V1    E18   E       18     2016-06-15 07:18:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-05-12 08:47:00 UTC" "2016-06-15 07:18:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8758    0
##       TRUE      0   54
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Houston, TX (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-08-14 00:53:00 V1    N53   N       53     2016-08-14 00:53:00 FALSE     
## 2 2016-08-14 22:53:00 V1    N53   N       53     2016-08-14 22:53:00 FALSE     
## 3 2016-09-20 22:53:00 V1    E05   E       05     2016-09-20 22:05:00 FALSE     
## 4 2016-09-24 19:53:00 V1    E11   E       11     2016-09-24 19:11:00 FALSE     
## 5 2016-09-24 19:53:00 V2    B29   B       29     2016-09-24 19:29:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-09-20 22:05:00 UTC" "2016-09-24 19:11:00 UTC"
## 
## Begin times added
## [1] "2016-08-14 00:53:00 UTC" "2016-08-14 22:53:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8644    0
##       TRUE      0  172
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Indianapolis, IN (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-07-18 18:54:00 V2    E21   E       21     2016-07-18 18:21:00 1980 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 14
## Issues where state change does not add to the total: 16
## Total issues (may be less than sum due to same datetime in both): 16 
## 
## 
## Records to be addressed include:
## # A tibble: 31 x 9
##    dtime               name  value chgType chgNum chgTime             lastRecord
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
##  1 2016-04-28 10:54:00 V1    B37   B       37     2016-04-28 10:37:00 FALSE     
##  2 2016-06-27 02:54:00 V1    N54   N       54     2016-06-27 02:54:00 FALSE     
##  3 2016-07-13 20:54:00 V1    E40   E       40     2016-07-13 20:40:00 FALSE     
##  4 2016-07-18 18:54:00 V1    B38   B       38     2016-07-18 18:38:00 FALSE     
##  5 2016-09-08 19:54:00 V1    E27   E       27     2016-09-08 19:27:00 FALSE     
##  6 2016-09-08 19:54:00 V2    B42   B       42     2016-09-08 19:42:00 FALSE     
##  7 2016-09-10 06:54:00 V1    E30   E       30     2016-09-10 06:30:00 FALSE     
##  8 2016-09-10 06:54:00 V2    E31   E       31     2016-09-10 06:31:00 FALSE     
##  9 2016-09-10 06:54:00 V3    B37   B       37     2016-09-10 06:37:00 FALSE     
## 10 2016-09-16 19:54:00 V1    E13   E       13     2016-09-16 19:13:00 FALSE     
## # ... with 21 more rows, and 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
##  [1] "2016-04-28 10:37:00 UTC" "2016-07-13 20:40:00 UTC"
##  [3] "2016-07-18 18:38:00 UTC" "2016-09-08 19:27:00 UTC"
##  [5] "2016-09-10 06:31:00 UTC" "2016-09-16 19:13:00 UTC"
##  [7] "2016-09-26 02:04:00 UTC" "2016-09-26 04:42:00 UTC"
##  [9] "2016-09-26 06:02:00 UTC" "2016-09-28 15:01:00 UTC"
## [11] "2016-09-28 20:10:00 UTC" "2016-09-30 23:06:00 UTC"
## [13] "2016-10-20 01:04:00 UTC" "2016-10-20 03:32:00 UTC"
## 
## Begin times added
## [1] "2016-06-27 02:54:00 UTC"
## 
## End times added:
## [1] "2016-07-18 18:54:00 UTC" "2016-10-20 06:54:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8671    0
##       TRUE      0   96
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Las Vegas, NV (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 5
## Issues where state change does not add to the total: 7
## Total issues (may be less than sum due to same datetime in both): 7 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-29 00:56:00 V1    E09   E       09     2016-04-29 00:09:00 FALSE     
## 2 2016-04-29 00:56:00 V2    B29   B       29     2016-04-29 00:29:00 FALSE     
## 3 2016-04-30 23:56:00 V1    N56   N       56     2016-04-30 23:56:00 FALSE     
## 4 2016-06-28 22:56:00 V1    E51   E       51     2016-06-28 22:51:00 FALSE     
## 5 2016-08-04 04:56:00 V1    N56   N       56     2016-08-04 04:56:00 FALSE     
## 6 2016-08-04 14:56:00 V1    E03   E       03     2016-08-04 14:03:00 FALSE     
## 7 2016-08-27 03:56:00 V1    E53   E       53     2016-08-27 03:53:00 FALSE     
## 8 2016-10-24 20:56:00 V1    E19   E       19     2016-10-24 20:19:00 FALSE     
## 9 2016-10-24 20:56:00 V2    B36   B       36     2016-10-24 20:36:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-29 00:09:00 UTC" "2016-06-28 22:51:00 UTC"
## [3] "2016-08-04 14:03:00 UTC" "2016-08-27 03:53:00 UTC"
## [5] "2016-10-24 20:19:00 UTC"
## 
## Begin times added
## [1] "2016-04-30 23:56:00 UTC" "2016-08-04 04:56:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8783    0
##       TRUE      0   35
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Lincoln, NE (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-24 10:54:00 V1    N54   N       54     2016-04-24 10:54:00 FALSE     
## 2 2016-05-23 14:54:00 V1    E31   E       31     2016-05-23 14:31:00 FALSE     
## 3 2016-05-23 14:54:00 V2    B44   B       44     2016-05-23 14:44:00 FALSE     
## 4 2016-06-18 08:54:00 V1    B44   B       44     2016-06-18 08:44:00 FALSE     
## 5 2016-09-16 06:54:00 V1    E0558 E       0558   2016-09-16 05:58:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-05-23 14:31:00 UTC" "2016-06-18 08:44:00 UTC"
## [3] "2016-09-16 05:58:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-04-24 10:54:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8758    0
##       TRUE      0   55
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Milwaukee, WI (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 8
## Total issues (may be less than sum due to same datetime in both): 8 
## 
## 
## Records to be addressed include:
## # A tibble: 8 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-31 17:52:00 V1    N52   N       52     2016-03-31 17:52:00 FALSE     
## 2 2016-04-06 08:52:00 V1    E10   E       10     2016-04-06 08:10:00 FALSE     
## 3 2016-06-10 03:52:00 V1    N52   N       52     2016-06-10 03:52:00 FALSE     
## 4 2016-06-26 06:52:00 V1    N52   N       52     2016-06-26 06:52:00 FALSE     
## 5 2016-06-26 11:52:00 V1    N52   N       52     2016-06-26 11:52:00 FALSE     
## 6 2016-07-01 04:52:00 V1    E07   E       07     2016-07-01 04:07:00 FALSE     
## 7 2016-07-06 11:52:00 V1    N52   N       52     2016-07-06 11:52:00 FALSE     
## 8 2016-07-17 13:52:00 V1    N52   N       52     2016-07-17 13:52:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-06 08:10:00 UTC" "2016-07-01 04:07:00 UTC"
## 
## Begin times added
## [1] "2016-06-10 03:52:00 UTC" "2016-06-26 06:52:00 UTC"
## [3] "2016-06-26 11:52:00 UTC"
## 
## End times added:
## [1] "2016-03-31 17:52:00 UTC" "2016-07-06 11:52:00 UTC"
## [3] "2016-07-17 13:52:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8757    0
##       TRUE      0   51
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Madison, WI (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 7
## Issues where state change does not add to the total: 9
## Total issues (may be less than sum due to same datetime in both): 9 
## 
## 
## Records to be addressed include:
## # A tibble: 15 x 9
##    dtime               name  value chgType chgNum chgTime             lastRecord
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
##  1 2016-03-16 07:53:00 V1    N53   N       53     2016-03-16 07:53:00 FALSE     
##  2 2016-05-27 10:53:00 V1    E14   E       14     2016-05-27 10:14:00 FALSE     
##  3 2016-05-27 10:53:00 V2    B32   B       32     2016-05-27 10:32:00 FALSE     
##  4 2016-07-11 04:53:00 V1    E00   E       00     2016-07-11 04:00:00 FALSE     
##  5 2016-09-07 12:53:00 V1    E14   E       14     2016-09-07 12:14:00 FALSE     
##  6 2016-09-13 08:53:00 V1    E22   E       22     2016-09-13 08:22:00 FALSE     
##  7 2016-09-13 08:53:00 V2    B38   B       38     2016-09-13 08:38:00 FALSE     
##  8 2016-09-13 08:53:00 V3    E53   E       53     2016-09-13 08:53:00 FALSE     
##  9 2016-09-16 11:53:00 V1    E20   E       20     2016-09-16 11:20:00 FALSE     
## 10 2016-09-19 21:53:00 V1    E45   E       45     2016-09-19 21:45:00 FALSE     
## 11 2016-09-20 00:53:00 V1    B07   B       07     2016-09-20 00:07:00 FALSE     
## 12 2016-10-16 04:53:00 V1    B0358 B       0358   2016-10-16 03:58:00 FALSE     
## 13 2016-10-16 04:53:00 V2    E01   E       01     2016-10-16 04:01:00 FALSE     
## 14 2016-10-16 04:53:00 V3    E03   E       03     2016-10-16 04:03:00 FALSE     
## 15 2016-10-16 04:53:00 V4    B15   B       15     2016-10-16 04:15:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-05-27 10:14:00 UTC" "2016-07-11 04:00:00 UTC"
## [3] "2016-09-07 12:14:00 UTC" "2016-09-13 08:22:00 UTC"
## [5] "2016-09-16 11:20:00 UTC" "2016-09-19 21:45:00 UTC"
## [7] "2016-10-16 04:03:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-03-16 07:53:00 UTC" "2016-09-20 00:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8735    0
##       TRUE      0   63
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Minneapolis, MN (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-09-06 08:53:00 V1    B0735 B       0735   2016-09-06 07:35:00 4680 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 6
## Issues where state change does not add to the total: 6
## Total issues (may be less than sum due to same datetime in both): 6 
## 
## 
## Records to be addressed include:
## # A tibble: 14 x 9
##    dtime               name  value chgType chgNum chgTime             lastRecord
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
##  1 2016-07-10 10:53:00 V1    E20   E       20     2016-07-10 10:20:00 FALSE     
##  2 2016-07-10 10:53:00 V2    B26   B       26     2016-07-10 10:26:00 FALSE     
##  3 2016-07-23 15:53:00 V1    E28   E       28     2016-07-23 15:28:00 FALSE     
##  4 2016-07-23 15:53:00 V2    B34   B       34     2016-07-23 15:34:00 FALSE     
##  5 2016-08-11 04:53:00 V1    E0359 E       0359   2016-08-11 03:59:00 FALSE     
##  6 2016-08-11 04:53:00 V2    B02   B       02     2016-08-11 04:02:00 FALSE     
##  7 2016-08-11 04:53:00 V3    E50   E       50     2016-08-11 04:50:00 FALSE     
##  8 2016-08-16 21:53:00 V1    E09   E       09     2016-08-16 21:09:00 FALSE     
##  9 2016-08-16 21:53:00 V2    B21   B       21     2016-08-16 21:21:00 FALSE     
## 10 2016-09-22 02:53:00 V1    E48   E       48     2016-09-22 02:48:00 FALSE     
## 11 2016-09-22 02:53:00 V2    E49   E       49     2016-09-22 02:49:00 FALSE     
## 12 2016-10-18 04:53:00 V1    E03   E       03     2016-10-18 04:03:00 FALSE     
## 13 2016-10-18 04:53:00 V2    B05   B       05     2016-10-18 04:05:00 FALSE     
## 14 2016-10-18 04:53:00 V3    E20   E       20     2016-10-18 04:20:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-07-10 10:20:00 UTC" "2016-07-23 15:28:00 UTC"
## [3] "2016-08-11 03:59:00 UTC" "2016-08-16 21:09:00 UTC"
## [5] "2016-09-22 02:49:00 UTC" "2016-10-18 04:03:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8763    0
##       TRUE      0   54
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: New Orleans, LA (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 14
## Issues where state change does not add to the total: 18
## Total issues (may be less than sum due to same datetime in both): 18 
## 
## 
## Records to be addressed include:
## # A tibble: 27 x 9
##    dtime               name  value chgType chgNum chgTime             lastRecord
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
##  1 2016-03-11 18:53:00 V1    N53   N       53     2016-03-11 18:53:00 FALSE     
##  2 2016-03-11 19:53:00 V1    N53   N       53     2016-03-11 19:53:00 FALSE     
##  3 2016-04-13 11:53:00 V1    E14   E       14     2016-04-13 11:14:00 FALSE     
##  4 2016-04-13 11:53:00 V2    B37   B       37     2016-04-13 11:37:00 FALSE     
##  5 2016-04-13 11:53:00 V3    B50   B       50     2016-04-13 11:50:00 FALSE     
##  6 2016-06-03 22:53:00 V1    B40   B       40     2016-06-03 22:40:00 FALSE     
##  7 2016-06-28 21:53:00 V1    B2056 B       2056   2016-06-28 20:56:00 FALSE     
##  8 2016-06-28 21:53:00 V2    E22   E       22     2016-06-28 21:22:00 FALSE     
##  9 2016-07-13 18:53:00 V1    N53   N       53     2016-07-13 18:53:00 FALSE     
## 10 2016-09-02 02:53:00 V1    E31   E       31     2016-09-02 02:31:00 FALSE     
## # ... with 17 more rows, and 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
##  [1] "2016-04-13 11:50:00 UTC" "2016-06-03 22:40:00 UTC"
##  [3] "2016-06-28 20:56:00 UTC" "2016-09-02 02:31:00 UTC"
##  [5] "2016-09-03 18:58:00 UTC" "2016-09-04 14:05:00 UTC"
##  [7] "2016-09-04 19:50:00 UTC" "2016-09-05 14:03:00 UTC"
##  [9] "2016-09-06 20:44:00 UTC" "2016-09-10 19:31:00 UTC"
## [11] "2016-09-15 20:19:00 UTC" "2016-09-27 00:45:00 UTC"
## [13] "2016-12-06 00:57:00 UTC" "2016-12-06 02:02:00 UTC"
## 
## Begin times added
## [1] "2016-03-11 18:53:00 UTC" "2016-07-13 18:53:00 UTC"
## 
## End times added:
## [1] "2016-03-11 19:53:00 UTC" "2016-09-11 21:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8606    0
##       TRUE      0  207
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Chicago, IL (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 4
## Issues where state change does not add to the total: 6
## Total issues (may be less than sum due to same datetime in both): 6 
## 
## 
## Records to be addressed include:
## # A tibble: 7 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-08-28 08:51:00 V1    B25   B       25     2016-08-28 08:25:00 FALSE     
## 2 2016-08-28 09:51:00 V1    E10   E       10     2016-08-28 09:10:00 FALSE     
## 3 2016-09-21 17:51:00 V1    B44   B       44     2016-09-21 17:44:00 FALSE     
## 4 2016-09-21 18:51:00 V1    E1759 E       1759   2016-09-21 17:59:00 FALSE     
## 5 2016-10-06 21:51:00 V1    E31   E       31     2016-10-06 21:31:00 FALSE     
## 6 2016-10-12 17:51:00 V1    E33   E       33     2016-10-12 17:33:00 FALSE     
## 7 2016-10-12 17:51:00 V2    B36   B       36     2016-10-12 17:36:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-08-28 09:10:00 UTC" "2016-09-21 17:59:00 UTC"
## [3] "2016-10-06 21:31:00 UTC" "2016-10-12 17:33:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-08-28 08:51:00 UTC" "2016-09-21 17:51:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8749    0
##       TRUE      0   66
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: San Diego, CA (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8808    0
##       TRUE      0    2
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Traverse City, MI (2016) 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 5
## Issues where state change does not add to the total: 6
## Total issues (may be less than sum due to same datetime in both): 6 
## 
## 
## Records to be addressed include:
## # A tibble: 8 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-16 12:53:00 V1    N53   N       53     2016-03-16 12:53:00 FALSE     
## 2 2016-06-11 02:53:00 V1    E21   E       21     2016-06-11 02:21:00 FALSE     
## 3 2016-06-11 02:53:00 V2    E31   E       31     2016-06-11 02:31:00 FALSE     
## 4 2016-07-14 05:53:00 V1    E35   E       35     2016-07-14 05:35:00 FALSE     
## 5 2016-07-14 05:53:00 V2    B47   B       47     2016-07-14 05:47:00 FALSE     
## 6 2016-08-17 07:53:00 V1    E40   E       40     2016-08-17 07:40:00 FALSE     
## 7 2016-08-24 11:53:00 V1    E43   E       43     2016-08-24 11:43:00 FALSE     
## 8 2016-09-26 02:53:00 V1    E52   E       52     2016-09-26 02:52:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-06-11 02:31:00 UTC" "2016-07-14 05:35:00 UTC"
## [3] "2016-08-17 07:40:00 UTC" "2016-08-24 11:43:00 UTC"
## [5] "2016-09-26 02:52:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-03-16 12:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8781    0
##       TRUE      0   33
## 
## Full matches between METAR observations and intervals

The thunder data parsed successfully, with all intervals of the proper length. Notably, there were proportionally many more changes to the begin times and end times for thunder than with rain or snow, suggesting that tracking TSBnnEnn may be less precise than tracking RABnnEnn and SNBnnEnn.

Next steps are to create and plot metrics by precipitation type across locales.

There are several areas of interest for precipitation:

  • What were the monthly totals of precipitation time by locale?
  • How many total precipitation events occurred by time and locale?
  • How much total time of each precipitation was observed in each locale?
  • How many total precipitation events were observed in each locale?

The first function takes a list and returns total precipitation length and total precipitation events by locale and month (overall totals can be summed from this):

# Extract precipitation information (total hours, number of events) by locale and month
precipByLocale <- function(lst, 
                           listItem="beginEndInterval", 
                           subName=-4, 
                           intervalVar="precipInts", 
                           minTime=ymd("2016-01-01"),
                           maxTime=ymd("2016-12-31"), 
                           mapper=cityNameMapper
                           ) {
    
    # FUNCTION ARGUMENTS:
    # lst - the list containing the output for the specified precipitation type
    # listItem - the name of the item to extract from each element of the list
    # subName - the second argument for str_sub() for converting list item name to original file name
    # intervalVar - name of the interval variable in listItem
    # minTime - ignore any interval with a start time before this date
    # maxTime - ignore any interval with a start time after this date
    # mapper - named vector for mapping source to descriptive locale
    
    # Get the names of the objects in the list
    beNames <- names(lst) %>% str_sub(1, subName)
    
    # Get the interval data by locale and apply the original file names
    beData <- sapply(lst, FUN="[", listItem)
    names(beData) <- beNames
    

    # Get the total precipitation length and total precipitation events by month
    # Exclude any event that begins before beginDate or after afterDate
    # Count the precipitation as being in the month when it begins
    
    # Helper function for extraction of total precipitation by month
    getMonthly <- function(x, y=intervalVar, minT=minTime, maxT=maxTime) {
        x %>%
            mutate(intStart=lubridate::int_start(get(intervalVar)), 
                   intHours=lubridate::time_length(get(intervalVar))/3600, 
                   month=factor(month.abb[lubridate::month(intStart)], levels=month.abb)
                   ) %>%
            filter(intStart >= minTime, intStart <= maxTime) %>%
            group_by(month) %>%
            summarize(hours=sum(intHours), events=n()) %>%
            right_join(tibble::tibble(month=factor(month.abb, levels=month.abb)), by="month") %>%
            mutate(hours=ifelse(is.na(hours), 0, hours), 
                   events=ifelse(is.na(events), 0 , events)
                   ) %>%
            select(month, hours, events) %>%
            pivot_longer(-month, names_to="variable", values_to="value")
    }
    
    # Extract total hours and events by month
    monPrecip <- purrr::map_dfr(beData, .f=getMonthly, .id="source")
    
    # Pivot wider so that file is unique by source-month with hours and events are columns
    # Add locale name from mapper
    monPrecip %>%
        pivot_wider(c(source, month), names_from="variable", values_from="value") %>%
        mutate(locale=mapper[source]) %>%
        select(source, locale, everything())
    
}

The function can then be run on thunderList:

thunderLength <- precipByLocale(thunderList)
thunderLength
## # A tibble: 180 x 5
##    source    locale             month hours events
##    <chr>     <chr>              <fct> <dbl>  <dbl>
##  1 kdtw_2016 Detroit, MI (2016) Jan   0          0
##  2 kdtw_2016 Detroit, MI (2016) Feb   1.48       2
##  3 kdtw_2016 Detroit, MI (2016) Mar   5.93       5
##  4 kdtw_2016 Detroit, MI (2016) Apr   4.05       3
##  5 kdtw_2016 Detroit, MI (2016) May   1.83       2
##  6 kdtw_2016 Detroit, MI (2016) Jun   1.68       2
##  7 kdtw_2016 Detroit, MI (2016) Jul   7.23       6
##  8 kdtw_2016 Detroit, MI (2016) Aug   6.38      10
##  9 kdtw_2016 Detroit, MI (2016) Sep   1.70       3
## 10 kdtw_2016 Detroit, MI (2016) Oct   0.916      2
## # ... with 170 more rows

A function can then be written to create four types of plots:

  • Total precipitation length by locale
  • Total precipitation events by locale
  • Precipitation length by month, faceted by locale
  • Precipitation events by month, faceted by locale
plotPrecipitation <- function(tbl, 
                              pTypeName=""
                              ) {
    
    # FUNCTION ARGUMENTS:
    # tbl: the tibble containing the processed data
    # pTypeName: the type of precipitation as a character
    
    # Plot 1: Total precipitation length by locale
    p1 <- tbl %>%
        group_by(locale) %>%
        summarize(hours=sum(hours)) %>%
        ggplot(aes(x=fct_reorder(locale, hours), y=hours)) +
        geom_col(fill="lightblue") +
        geom_text(aes(label=round(hours), y=hours/2)) + 
        coord_flip() + 
        labs(x="", 
             y="Hours", 
             title=paste0("Total hours of ", stringr::str_to_lower(pTypeName), " by locale")
             )
    print(p1)
    
    
    # Plot 2: Total precipitation events by locale
    p2 <- tbl %>%
        group_by(locale) %>%
        summarize(events=sum(events)) %>%
        ggplot(aes(x=fct_reorder(locale, events), y=events)) +
        geom_col(fill="lightblue") +
        geom_text(aes(label=round(events), y=events/2)) + 
        coord_flip() + 
        labs(x="", 
             y="# Unique Events", 
             title=paste0("Total unique events of ", stringr::str_to_lower(pTypeName), " by locale"), 
             caption="An event is defined by a begin and end and may be as short as a minute\nor with as little as a minute gap to the next event"
             )
    print(p2)
    
    
    # Plot 3: Precipitation length by month, facetted by locale
    mm <- tbl %>%
        group_by(month) %>%
        summarize(hours=mean(hours))
    p3 <- tbl %>%
        ggplot(aes(x=month, y=hours)) +
        geom_col(fill="lightblue") +
        labs(y="Hours", 
             x="", 
             title=paste0("Total hours of ", stringr::str_to_lower(pTypeName), " by month"), 
             subtitle="Red dots are the monthly average across locales in this plot"
             ) + 
        geom_point(data=mm, color="red") +
        facet_wrap(~ locale) + 
        theme(axis.text.x=element_text(angle=90))
    print(p3)
    
    # Plot 4: Precipitation events by month, facetted by locale
    mm <- tbl %>%
        group_by(month) %>%
        summarize(events=mean(events))
    p4 <- tbl %>%
        ggplot(aes(x=month, y=events)) +
        geom_col(fill="lightblue") +
        labs(y="# Unique Events", 
             x="", 
             title=paste0("Total unique events of ", stringr::str_to_lower(pTypeName), " by month"), 
             subtitle="Red dots are the monthly average across locales in this plot", 
             caption="An event is defined by a begin and end and may be as short as a minute\nor with as little as a minute gap to the next event"
             ) + 
        geom_point(data=mm, color="red") +
        facet_wrap(~ locale) + 
        theme(axis.text.x=element_text(angle=90))
    print(p4)
    
}

The function is then run for the thunder data:

plotPrecipitation(thunderLength, pTypeName="thunder")

Processing is then run for the rain and snow data:

fileNames <- grep(pattern="2016", names(cityNameMapper), value=TRUE)
rainList <- vector("list", length(fileNames))
names(rainList) <- paste0(fileNames, "_RA")

for (city in fileNames) {
    
    cat("\n\n*** Processing for:", cityNameMapper[city], "\n")
    rainList[[paste0(city, "_RA")]] <- combinePrecipFunctions(get(city), 
                                                              pType="(?<!FZ)RA", 
                                                              errorLengthInconsistency=TRUE
                                                                 )
    
}
## 
## 
## *** Processing for: Detroit, MI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 7
## Total issues (may be less than sum due to same datetime in both): 7 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-24 11:53:00 V1    N53   N       53     2016-02-24 11:53:00 FALSE     
## 2 2016-04-01 00:53:00 V1    B01   B       01     2016-04-01 00:01:00 FALSE     
## 3 2016-04-01 00:53:00 V2    E12   E       12     2016-04-01 00:12:00 FALSE     
## 4 2016-04-07 22:53:00 V1    N53   N       53     2016-04-07 22:53:00 FALSE     
## 5 2016-04-07 23:53:00 V1    N53   N       53     2016-04-07 23:53:00 FALSE     
## 6 2016-07-14 00:53:00 V1    E30   E       30     2016-07-14 00:30:00 FALSE     
## 7 2016-08-15 00:53:00 V1    B2355 B       2355   2016-08-14 23:55:00 FALSE     
## 8 2016-08-15 00:53:00 V2    E29   E       29     2016-08-15 00:29:00 FALSE     
## 9 2016-08-16 14:53:00 V1    N53   N       53     2016-08-16 14:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-01 00:01:00 UTC" "2016-07-14 00:30:00 UTC"
## [3] "2016-08-14 23:55:00 UTC"
## 
## Begin times added
## [1] "2016-02-24 11:53:00 UTC" "2016-04-07 23:53:00 UTC"
## 
## End times added:
## [1] "2016-04-07 22:53:00 UTC" "2016-08-16 14:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8258    0
##       TRUE      0  560
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Newark, NJ (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-24 19:51:00 V1    B1858 B       1858   2016-02-24 18:58:00 FALSE     
## 2 2016-11-20 08:51:00 V1    E0758 E       0758   2016-11-20 07:58:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-02-24 18:58:00 UTC" "2016-11-20 07:58:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8195    0
##       TRUE      0  626
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Green Bay, WI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-05-02 17:53:00 V1    E1659 E       1659   2016-05-02 16:59:00 FALSE     
## 2 2016-05-02 17:53:00 V2    B39   B       39     2016-05-02 17:39:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-05-02 17:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8378    0
##       TRUE      0  425
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Grand Rapids, MI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-07 12:53:00 V1    E15   E       15     2016-04-07 12:15:00 FALSE     
## 2 2016-06-26 13:53:00 V1    B46   B       46     2016-06-26 13:46:00 FALSE     
## 3 2016-06-26 14:53:00 V1    N53   N       53     2016-06-26 14:53:00 FALSE     
## 4 2016-07-24 15:53:00 V1    N53   N       53     2016-07-24 15:53:00 FALSE     
## 5 2016-08-12 09:53:00 V1    B0854 B       0854   2016-08-12 08:54:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-07 12:15:00 UTC" "2016-08-12 08:54:00 UTC"
## 
## Begin times added
## [1] "2016-06-26 14:53:00 UTC"
## 
## End times added:
## [1] "2016-06-26 13:53:00 UTC" "2016-07-24 15:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8208    0
##       TRUE      0  604
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Houston, TX (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-03-23 22:53:00 V1    B2145 B       2145   2016-03-23 21:45:00 4080 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 7
## Total issues (may be less than sum due to same datetime in both): 7 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-23 22:53:00 V2    E16   E       16     2016-03-23 22:16:00 FALSE     
## 2 2016-06-06 00:53:00 V1    E40   E       40     2016-06-06 00:40:00 FALSE     
## 3 2016-08-14 00:53:00 V1    N53   N       53     2016-08-14 00:53:00 FALSE     
## 4 2016-08-14 22:53:00 V1    N53   N       53     2016-08-14 22:53:00 FALSE     
## 5 2016-08-21 19:53:00 V1    B1857 B       1857   2016-08-21 18:57:00 FALSE     
## 6 2016-08-21 19:53:00 V2    E27   E       27     2016-08-21 19:27:00 FALSE     
## 7 2016-08-21 19:53:00 V3    B45   B       45     2016-08-21 19:45:00 FALSE     
## 8 2016-12-03 18:53:00 V1    E12   E       12     2016-12-03 18:12:00 FALSE     
## 9 2016-12-04 10:53:00 V1    N53   N       53     2016-12-04 10:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-03-23 22:16:00 UTC" "2016-06-06 00:40:00 UTC"
## [3] "2016-12-03 18:12:00 UTC"
## 
## Begin times added
## [1] "2016-08-14 00:53:00 UTC" "2016-08-14 22:53:00 UTC"
## 
## End times added:
## [1] "2016-08-21 19:53:00 UTC" "2016-12-04 10:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8368    0
##       TRUE      0  448
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Indianapolis, IN (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 6 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-07 23:54:00 V1    N54   N       54     2016-04-07 23:54:00 FALSE     
## 2 2016-05-08 19:54:00 V1    N54   N       54     2016-05-08 19:54:00 FALSE     
## 3 2016-07-18 17:54:00 V1    N54   N       54     2016-07-18 17:54:00 FALSE     
## 4 2016-08-07 19:54:00 V1    N54   N       54     2016-08-07 19:54:00 FALSE     
## 5 2016-09-05 21:54:00 V1    B01   B       01     2016-09-05 21:01:00 FALSE     
## 6 2016-09-05 21:54:00 V2    E38   E       38     2016-09-05 21:38:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-09-05 21:01:00 UTC"
## 
## Begin times added
## [1] "2016-07-18 17:54:00 UTC"
## 
## End times added:
## [1] "2016-04-07 23:54:00 UTC" "2016-05-08 19:54:00 UTC"
## [3] "2016-08-07 19:54:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8144    0
##       TRUE      0  623
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Las Vegas, NV (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-30 23:56:00 V1    N56   N       56     2016-04-30 23:56:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-04-30 23:56:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8710    0
##       TRUE      0  108
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Lincoln, NE (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-07-13 18:54:00 V1    N54   N       54     2016-07-13 18:54:00 FALSE     
## 2 2016-08-29 22:54:00 V1    N54   N       54     2016-08-29 22:54:00 FALSE     
## 3 2016-09-16 06:54:00 V1    N54   N       54     2016-09-16 06:54:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-08-29 22:54:00 UTC" "2016-09-16 06:54:00 UTC"
## 
## End times added:
## [1] "2016-07-13 18:54:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8500    0
##       TRUE      0  313
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Milwaukee, WI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 4 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-10 10:52:00 V1    N52   N       52     2016-04-10 10:52:00 FALSE     
## 2 2016-06-10 03:52:00 V1    N52   N       52     2016-06-10 03:52:00 FALSE     
## 3 2016-08-20 13:52:00 V1    N52   N       52     2016-08-20 13:52:00 FALSE     
## 4 2016-10-07 02:52:00 V1    N52   N       52     2016-10-07 02:52:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-04-10 10:52:00 UTC" "2016-06-10 03:52:00 UTC"
## [3] "2016-08-20 13:52:00 UTC" "2016-10-07 02:52:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8346    0
##       TRUE      0  462
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Madison, WI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-09-07 12:53:00 V1    E52   E       52     2016-09-07 12:52:00 FALSE     
## 2 2016-09-16 11:53:00 V1    N53   N       53     2016-09-16 11:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-09-07 12:52:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-09-16 11:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8347    0
##       TRUE      0  451
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Minneapolis, MN (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-09-06 08:53:00 V1    B0744 B       0744   2016-09-06 07:44:00 4140 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 8
## Total issues (may be less than sum due to same datetime in both): 8 
## 
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-05 15:53:00 V1    E22   E       22     2016-04-05 15:22:00 FALSE     
## 2 2016-05-29 19:53:00 V1    B31   B       31     2016-05-29 19:31:00 FALSE     
## 3 2016-06-20 04:53:00 V1    B11   B       11     2016-06-20 04:11:00 FALSE     
## 4 2016-09-06 07:53:00 V1    N53   N       53     2016-09-06 07:53:00 FALSE     
## 5 2016-09-06 08:53:00 V2    E01   E       01     2016-09-06 08:01:00 FALSE     
## 6 2016-09-06 08:53:00 V3    B17   B       17     2016-09-06 08:17:00 FALSE     
## 7 2016-10-18 04:53:00 V1    E35   E       35     2016-10-18 04:35:00 FALSE     
## 8 2016-11-22 13:53:00 V1    N53   N       53     2016-11-22 13:53:00 FALSE     
## 9 2016-11-29 07:53:00 V1    B18   B       18     2016-11-29 07:18:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-10-18 04:35:00 UTC" "2016-11-29 07:18:00 UTC"
## 
## Begin times added
## [1] "2016-04-05 15:53:00 UTC" "2016-09-06 07:53:00 UTC"
## 
## End times added:
## [1] "2016-05-29 19:53:00 UTC" "2016-06-20 04:53:00 UTC"
## [3] "2016-09-06 08:53:00 UTC" "2016-11-22 13:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8390    0
##       TRUE      0  427
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: New Orleans, LA (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2017-01-01 09:53:00 V1    E0852 E       0852   2017-01-01 08:52:00 3660 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-11 20:53:00 V1    B07   B       07     2016-03-11 20:07:00 FALSE     
## 2 2016-03-27 16:53:00 V1    N53   N       53     2016-03-27 16:53:00 FALSE     
## 3 2016-08-14 23:53:00 V1    B39   B       39     2016-08-14 23:39:00 FALSE     
## 4 2016-12-13 23:53:00 V1    E2258 E       2258   2016-12-13 22:58:00 FALSE     
## 5 2016-12-13 23:53:00 V2    B30   B       30     2016-12-13 23:30:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Adding final end time to cap interval at end of file: 2017-01-01 23:53:01 
## 
## Start/end times deleted:
## [1] "2016-03-11 20:07:00 UTC" "2016-12-13 22:58:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-03-27 16:53:00 UTC" "2016-08-14 23:53:00 UTC"
## [3] "2017-01-01 23:53:01 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8375    0
##       TRUE      0  438
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Chicago, IL (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-24 21:51:00 V1    N51   N       51     2016-03-24 21:51:00 FALSE     
## 2 2016-07-13 18:51:00 V1    N51   N       51     2016-07-13 18:51:00 FALSE     
## 3 2016-08-29 19:51:00 V1    N51   N       51     2016-08-29 19:51:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-03-24 21:51:00 UTC" "2016-08-29 19:51:00 UTC"
## 
## End times added:
## [1] "2016-07-13 18:51:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: San Diego, CA (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8612    0
##       TRUE      0  198
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Traverse City, MI (2016) 
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 7 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-03-28 10:53:00 V1    E0957 E       0957   2016-03-28 09:57:00 FALSE     
## 2 2016-06-06 20:53:00 V1    B39   B       39     2016-06-06 20:39:00 FALSE     
## 3 2016-06-06 20:53:00 V2    E49   E       49     2016-06-06 20:49:00 FALSE     
## 4 2016-06-06 21:53:00 V1    B17   B       17     2016-06-06 21:17:00 FALSE     
## 5 2016-06-06 21:53:00 V2    E38   E       38     2016-06-06 21:38:00 FALSE     
## 6 2016-09-28 19:53:00 V1    N53   N       53     2016-09-28 19:53:00 FALSE     
## 7 2016-12-02 16:53:00 V1    E29   E       29     2016-12-02 16:29:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-03-28 09:57:00 UTC" "2016-06-06 21:17:00 UTC"
## [3] "2016-12-02 16:29:00 UTC"
## 
## Begin times added
## [1] "2016-06-06 20:53:00 UTC"
## 
## End times added:
## [1] "2016-09-28 19:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8345    0
##       TRUE      0  469
## 
## Full matches between METAR observations and intervals
fileNames <- grep(pattern="2016", names(cityNameMapper), value=TRUE)
snowList <- vector("list", length(fileNames))
names(snowList) <- paste0(fileNames, "_SN")

for (city in fileNames) {
    
    cat("\n\n*** Processing for:", cityNameMapper[city], "\n")
    snowList[[paste0(city, "_SN")]] <- combinePrecipFunctions(get(city), 
                                                              pType="(?<!BL)SN", 
                                                              errorLengthInconsistency=TRUE
                                                              )
    
}
## 
## 
## *** Processing for: Detroit, MI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-09 10:53:00 V1    N53   N       53     2016-02-09 10:53:00 FALSE     
## 2 2016-02-24 11:53:00 V1    B12   B       12     2016-02-24 11:12:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-02-09 10:53:00 UTC"
## 
## End times added:
## [1] "2016-02-24 11:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8344    0
##       TRUE      0  474
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Newark, NJ (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-10 14:51:00 V1    N51   N       51     2016-02-10 14:51:00 FALSE     
## 2 2016-11-20 07:51:00 V1    N51   N       51     2016-11-20 07:51:00 FALSE     
## 3 2016-11-20 08:51:00 V1    E0756 E       0756   2016-11-20 07:56:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-11-20 07:56:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-02-10 14:51:00 UTC" "2016-11-20 07:51:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8683    0
##       TRUE      0  138
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Green Bay, WI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-12-07 15:53:00 V1    E09   E       09     2016-12-07 15:09:00 FALSE     
## 2 2016-12-07 15:53:00 V2    B38   B       38     2016-12-07 15:38:00 FALSE     
## 3 2016-12-29 17:53:00 V1    N53   N       53     2016-12-29 17:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-12-07 15:09:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-12-29 17:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8460    0
##       TRUE      0  343
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Grand Rapids, MI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 3
## Total issues (may be less than sum due to same datetime in both): 3 
## 
## 
## Records to be addressed include:
## # A tibble: 3 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-04 20:53:00 V1    N53   N       53     2016-02-04 20:53:00 FALSE     
## 2 2016-04-02 16:53:00 V1    N53   N       53     2016-04-02 16:53:00 FALSE     
## 3 2016-04-02 17:53:00 V1    N53   N       53     2016-04-02 17:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-02-04 20:53:00 UTC" "2016-04-02 17:53:00 UTC"
## 
## End times added:
## [1] "2016-04-02 16:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8137    0
##       TRUE      0  675
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Houston, TX (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8816
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Indianapolis, IN (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8529    0
##       TRUE      0  238
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Las Vegas, NV (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8818
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Lincoln, NE (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-19 12:54:00 V1    B1155 B       1155   2016-01-19 11:55:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-01-19 11:55:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8719    0
##       TRUE      0   94
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Milwaukee, WI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 1
## Total issues (may be less than sum due to same datetime in both): 1 
## 
## 
## Records to be addressed include:
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-26 01:52:00 V1    E39   E       39     2016-01-26 01:39:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-01-26 01:39:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8462    0
##       TRUE      0  346
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Madison, WI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 3
## Issues where state change does not add to the total: 6
## Total issues (may be less than sum due to same datetime in both): 6 
## 
## 
## Records to be addressed include:
## # A tibble: 6 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-09 14:53:00 V1    E39   E       39     2016-01-09 14:39:00 FALSE     
## 2 2016-01-12 12:53:00 V1    N53   N       53     2016-01-12 12:53:00 FALSE     
## 3 2016-02-09 04:53:00 V1    N53   N       53     2016-02-09 04:53:00 FALSE     
## 4 2016-02-09 05:53:00 V1    N53   N       53     2016-02-09 05:53:00 FALSE     
## 5 2016-04-08 14:53:00 V1    E36   E       36     2016-04-08 14:36:00 FALSE     
## 6 2016-12-12 05:53:00 V1    E01   E       01     2016-12-12 05:01:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-01-09 14:39:00 UTC" "2016-04-08 14:36:00 UTC"
## [3] "2016-12-12 05:01:00 UTC"
## 
## Begin times added
## [1] "2016-02-09 05:53:00 UTC"
## 
## End times added:
## [1] "2016-01-12 12:53:00 UTC" "2016-02-09 04:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8435    0
##       TRUE      0  363
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Minneapolis, MN (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 4
## Total issues (may be less than sum due to same datetime in both): 4 
## 
## 
## Records to be addressed include:
## # A tibble: 4 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-01-12 00:53:00 V1    N53   N       53     2016-01-12 00:53:00 FALSE     
## 2 2016-12-07 14:53:00 V1    N53   N       53     2016-12-07 14:53:00 FALSE     
## 3 2016-12-07 15:53:00 V1    N53   N       53     2016-12-07 15:53:00 FALSE     
## 4 2016-12-08 22:53:00 V1    N53   N       53     2016-12-08 22:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## [1] "2016-01-12 00:53:00 UTC" "2016-12-07 15:53:00 UTC"
## [3] "2016-12-08 22:53:00 UTC"
## 
## End times added:
## [1] "2016-12-07 14:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8438    0
##       TRUE      0  379
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: New Orleans, LA (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8813
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Chicago, IL (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 5
## Total issues (may be less than sum due to same datetime in both): 5 
## 
## 
## Records to be addressed include:
## # A tibble: 5 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-02-10 00:51:00 V1    N51   N       51     2016-02-10 00:51:00 FALSE     
## 2 2016-03-24 21:51:00 V1    N51   N       51     2016-03-24 21:51:00 FALSE     
## 3 2016-04-08 22:51:00 V1    N51   N       51     2016-04-08 22:51:00 FALSE     
## 4 2016-04-09 04:51:00 V1    E46   E       46     2016-04-09 04:46:00 FALSE     
## 5 2016-12-24 01:51:00 V1    N51   N       51     2016-12-24 01:51:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-04-09 04:46:00 UTC"
## 
## Begin times added
## [1] "2016-02-10 00:51:00 UTC" "2016-03-24 21:51:00 UTC"
## [3] "2016-04-08 22:51:00 UTC" "2016-12-24 01:51:00 UTC"
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8532    0
##       TRUE      0  283
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: San Diego, CA (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 0
## Issues where state change does not add to the total: 0
## Total issues (may be less than sum due to same datetime in both): 0 
## 
## 
## Records to be addressed include:
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, lastRecord <lgl>, lagPrecip <lgl>,
## #   curPrecip <lgl>
## 
## Start/end times deleted:
## POSIXct of length 0
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## POSIXct of length 0
##            intMETAR
## precipMETAR FALSE
##       FALSE  8810
## 
## Full matches between METAR observations and intervals
## 
## 
## *** Processing for: Traverse City, MI (2016) 
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 0 x 9
## # ... with 9 variables: dtime <dttm>, name <chr>, value <chr>, chgType <chr>,
## #   chgNum <chr>, chgTime <dttm>, deltaMETAR <drtn>, deltaPrev <dbl>,
## #   issueCons <lgl>
## 
## 
## Issues with begin when precipitation or end when no precipitation: 1
## Issues where state change does not add to the total: 2
## Total issues (may be less than sum due to same datetime in both): 2 
## 
## 
## Records to be addressed include:
## # A tibble: 2 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-12-02 16:53:00 V1    E40   E       40     2016-12-02 16:40:00 FALSE     
## 2 2016-12-18 06:53:00 V1    N53   N       53     2016-12-18 06:53:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-12-02 16:40:00 UTC"
## 
## Begin times added
## POSIXct of length 0
## 
## End times added:
## [1] "2016-12-18 06:53:00 UTC"
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  7828    0
##       TRUE      0  986
## 
## Full matches between METAR observations and intervals

All precipitation intervals extracted are consistent with the precipitation noted in the METAR.

Plots can then be created for the rain data and snow data:

rainLength <- precipByLocale(rainList)
plotPrecipitation(rainLength, pTypeName="rain")

snowLength <- precipByLocale(snowList)
## Warning: Factor `month` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `month` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `month` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `month` contains implicit NA, consider using
## `forcats::fct_explicit_na`
plotPrecipitation(snowLength, pTypeName="snow")

Findings from the plots include:

RAIN:

  • Las Vegas experience the fewest hours of rain, as expected for a desert, while San Diego also experiences well below average hours of rain
  • Average hours of rainfall appear seasonal, with a peak in spring and then a smaller peak in fall
  • Note that this analysis is for hours not for inches, so an hour of heavy rains associated with summer convection will be treated the same as an hour of very light rain during the change of seasons

SNOW:

  • Houston, Las Vegas, New Orleans, and San Diego experienced no snow in 2016
  • Traverse City and Grand Rapids, both in the lake effect belt, experienced the most hours of snow in 2016
  • Snow is highly seasonal, occurring in the cold season as expected

THUNDER:

  • New Orleans and Houston experienced the most hours of thunder, as expected for hot-humid cities near the Gulf of Mexico
  • San Diego almost never experiences thunder, as expected for a city on the Pacific coast
  • Thunder is seasonal, occurring more often in the warm season as expected

Daily Summaries

METAR periodically contain aggregated summaries of the day’s weather. Two areas summarized include temperature and precipitation:

  • RMK.* 4dddddddd - the first dddd are the daily high temperature (leading d is 1 for negative, trailing d is decimal, report is in Celsius) and the second dddd are the daily low temperature
  • RMK.* Pdddd - the amount of liquid precipipitation in the past hour, in hundredths of inches
  • RMK.* 6dddd - the amount of liquid precipitation in the past 3/6 hours, in hundredths of inches
  • RMK.* 7dddd - the amount of liquid precipitation in the past 24 hours, in hundredths of inches

A function is written to extract daily high and low temperatures:

# Helper function to convert 4-digit function to temperature
convertTemp <- function(x, convF=TRUE) {
    
    # Convert to numeric
    temp <- ifelse(str_sub(x, 1, 1)==1, -1, 1) * as.numeric(str_sub(x, 2, 4)) / 10
    
    # Convert to Fahrenheit if requested
    if (convF) {
        temp <- round(32 + temp * 9/5, 0)
    }
    
    temp
    
}

# Extract the daily high and low temperature (convert to rounded Fahrenheit)
getDailyHighLow <- function(df) {

    # If a character has been passed, convert to the associated object
    if (is.character(df)) {
        df <- get(df)
    }
    
    df %>%
        select(dtime, origMETAR) %>%
        mutate(tempData=str_extract(origMETAR, pattern="RMK.* 4\\d{8}"), 
               tempFHi=convertTemp(str_sub(tempData, -8, -5)), 
               tempFLo=convertTemp(str_sub(tempData, -4, -1))
               ) %>%
        select(-origMETAR)
    
}

The function can then be run for all the 2016 data:

# Extracted high-low temperatures
all_2016_hilo <- map_dfr(.x=fileNames, .f=getDailyHighLow, .id="source") %>%
    mutate(source=fileNames[as.integer(source)], locale=cityNameMapper[source]) %>%
    select(source, locale, everything())
all_2016_hilo
## # A tibble: 132,143 x 6
##    source   locale      dtime               tempData             tempFHi tempFLo
##    <chr>    <chr>       <dttm>              <chr>                  <dbl>   <dbl>
##  1 kdtw_20~ Detroit, M~ 2015-12-31 00:53:00 <NA>                      NA      NA
##  2 kdtw_20~ Detroit, M~ 2015-12-31 01:53:00 <NA>                      NA      NA
##  3 kdtw_20~ Detroit, M~ 2015-12-31 02:53:00 <NA>                      NA      NA
##  4 kdtw_20~ Detroit, M~ 2015-12-31 03:53:00 <NA>                      NA      NA
##  5 kdtw_20~ Detroit, M~ 2015-12-31 04:53:00 RMK AO2 SLP224 T001~      37      32
##  6 kdtw_20~ Detroit, M~ 2015-12-31 05:53:00 <NA>                      NA      NA
##  7 kdtw_20~ Detroit, M~ 2015-12-31 06:53:00 <NA>                      NA      NA
##  8 kdtw_20~ Detroit, M~ 2015-12-31 07:53:00 <NA>                      NA      NA
##  9 kdtw_20~ Detroit, M~ 2015-12-31 08:53:00 <NA>                      NA      NA
## 10 kdtw_20~ Detroit, M~ 2015-12-31 09:53:00 <NA>                      NA      NA
## # ... with 132,133 more rows
# Counts of !is.na by hour
all_2016_hilo %>%
    mutate(hour=lubridate::hour(lubridate::round_date(dtime, unit="hours")), 
           notNA=!is.na(tempFHi)
           ) %>%
    ggplot(aes(x=hour, fill=notNA)) + 
    geom_bar(position="stack") + 
    scale_fill_discrete("Hi-Lo Data") + 
    labs(y="Number of Daily Observations", x="METAR Hour", 
         title="Daily High-Low temperatures are captured based on time zone"
         ) + 
    facet_wrap(~locale)

# Box plots of high and low temperatures by locale by month
all_2016_hilo %>%
    mutate(month=lubridate::month(dtime)) %>%
    filter(!is.na(tempData)) %>%
    ggplot(aes(x=factor(month.abb[month], levels=month.abb))) + 
    geom_boxplot(aes(y=tempFHi), fill="pink") + 
    labs(y="Temperature (F)", x="", 
         title="Daily High Temperatures by Month by Locale"
         ) + 
    facet_wrap(~locale) + 
    theme(axis.text.x=element_text(angle=90))

all_2016_hilo %>%
    mutate(month=lubridate::month(dtime)) %>%
    filter(!is.na(tempData)) %>%
    ggplot(aes(x=factor(month.abb[month], levels=month.abb))) + 
    geom_boxplot(aes(y=tempFLo), fill="lightblue") + 
    labs(y="Temperature (F)", x="", 
         title="Daily Low Temperatures by Month by Locale"
         ) + 
    facet_wrap(~locale) + 
    theme(axis.text.x=element_text(angle=90))

The high-low temperatures appear to be calculated almost always at 6Z in 8 Central time zone locales; captured almost always at 5Z in 5 East time zone locales; and captured almost always at 8Z in 2 West time zone locales.

Precipitation data can also be extracted by function:

# Helper function to convert 5-character precipitation to inches
convertPrecip <- function(x, begPos=-4, endPos=-1) {
    
    # Convert to numeric and divide by 100 to get inches
    as.numeric(str_sub(x, begPos, endPos)) / 100

}

# Extract precipitation amounts in inches
getPrecipAmount <- function(df) {
    
    # If a character has been passed, convert to the associated object
    if (is.character(df)) {
        df <- get(df)
    }
    
    df %>%
        select(dtime, origMETAR) %>%
        mutate(p1Inches=convertPrecip(str_extract(origMETAR, pattern="RMK.* P\\d{4}")), 
               p36Inches=convertPrecip(str_extract(origMETAR, pattern="RMK.* 6\\d{4}")), 
               p24Inches=convertPrecip(str_extract(origMETAR, pattern="RMK.* 7\\d{4}"))
               ) %>%
        select(-origMETAR)
    
}

The function can then be run for all the 2016 data:

# Extracted high-low temperatures
all_2016_pin <- map_dfr(.x=fileNames, .f=getPrecipAmount, .id="source") %>%
    mutate(source=fileNames[as.integer(source)], locale=cityNameMapper[source]) %>%
    select(source, locale, everything())
all_2016_pin
## # A tibble: 132,143 x 6
##    source    locale             dtime               p1Inches p36Inches p24Inches
##    <chr>     <chr>              <dttm>                 <dbl>     <dbl>     <dbl>
##  1 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00       NA        NA        NA
##  2 kdtw_2016 Detroit, MI (2016) 2015-12-31 01:53:00       NA        NA        NA
##  3 kdtw_2016 Detroit, MI (2016) 2015-12-31 02:53:00       NA        NA        NA
##  4 kdtw_2016 Detroit, MI (2016) 2015-12-31 03:53:00       NA        NA        NA
##  5 kdtw_2016 Detroit, MI (2016) 2015-12-31 04:53:00       NA        NA        NA
##  6 kdtw_2016 Detroit, MI (2016) 2015-12-31 05:53:00       NA        NA        NA
##  7 kdtw_2016 Detroit, MI (2016) 2015-12-31 06:53:00       NA        NA        NA
##  8 kdtw_2016 Detroit, MI (2016) 2015-12-31 07:53:00       NA        NA        NA
##  9 kdtw_2016 Detroit, MI (2016) 2015-12-31 08:53:00       NA        NA        NA
## 10 kdtw_2016 Detroit, MI (2016) 2015-12-31 09:53:00       NA        NA        NA
## # ... with 132,133 more rows
# Map of variable to chart name
mapChartVar <- c(p1Inches="1-hr Precip", p36Inches="3-hr or 6-hr Precip", p24Inches="24-hr Precip")

for (plotVar in c("p1Inches", "p36Inches", "p24Inches")) {
    
    # Counts of !is.na by hour
    p1 <- all_2016_pin %>%
        mutate(hour=factor(lubridate::hour(lubridate::round_date(dtime, unit="hours"))), 
               notNA=!is.na(get(plotVar))
               ) %>%
        ggplot(aes(x=hour, fill=notNA)) + 
        geom_bar(position="stack") + 
        scale_fill_discrete(paste0(mapChartVar[plotVar], " Data")) + 
        labs(y="Number of Days with Observations at Hour", x="METAR Hour", 
             title=paste0("Capture for ", mapChartVar[plotVar])
             ) + 
        facet_wrap(~locale) + 
        scale_x_discrete(breaks=seq(0, 23, by=3))
    print(p1)
    
}

Hourly precipitation data is recorded as needed in any hour (if precipitation occurred in the past hour).

3/6 hourly precipitation is recorded as needed in any time period 0Z-3Z-6Z-9Z-12Z-15Z-18Z-21Z as needed (if precipitation occurred in the past 3/6 hours).

24-hourly precipitation is recorded as needed at 12Z (if precipitation occurred in the past 24 hours).

The precipitation sums can also be checked for consistency:

# Sum precipitation by year for p1, p36 (using only 0Z-6Z-12Z-18Z), p24
all_2016_pin %>%
    mutate(p1Inches=ifelse(is.na(p1Inches), 0, p1Inches),
           p36ok=lubridate::hour(lubridate::round_date(dtime, unit="hours")) %in% c(0, 6, 12, 18),
           p36Inches=ifelse(is.na(p36Inches) | !p36ok, 0, p36Inches),
           p24Inches=ifelse(is.na(p24Inches), 0, p24Inches)
           ) %>%
    group_by(locale) %>%
    summarize_if(is.numeric, sum) %>%
    rename(`1-Hour`=p1Inches, `6-Hour`=p36Inches, `24-Hour`=p24Inches) %>%
    pivot_longer(-locale, names_to="Unit", values_to="Inches") %>%
    ggplot(aes(x=Unit, y=Inches)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(label=round(Inches, 1), y=Inches+10)) +
    labs(title="Annual precipitation when summing by various units", x="") +
    facet_wrap(~locale)

There are some minor differences when summing the precipitation by different units. The biggest outlier is that Houston 1-hour total being ~10 inches greater than the Houston 6-hour totals or 24-hour totals:

# Explore the Houston data by month
all_2016_pin %>% 
    filter(source=="kiah_2016") %>% 
    group_by(month=lubridate::month(dtime)) %>% 
    summarize(p1=sum(p1Inches, na.rm=TRUE), p24=sum(p24Inches, na.rm=TRUE))
## # A tibble: 12 x 3
##    month    p1   p24
##    <dbl> <dbl> <dbl>
##  1     1  2.16  2.46
##  2     2  1.95  1.95
##  3     3  3.25  3.23
##  4     4 12.1   3.28
##  5     5  7.2   7.74
##  6     6 13.0  12.9 
##  7     7  1.09  1.09
##  8     8 10.4  10.4 
##  9     9  1.7   1.71
## 10    10  0.14  0.14
## 11    11  1.99  1.99
## 12    12  3.41  2.67
# Explore the April data by day
all_2016_pin %>% 
    filter(source=="kiah_2016", lubridate::month(dtime)==4) %>% 
    group_by(day=lubridate::day(dtime)) %>% 
    summarize(p1=sum(p1Inches, na.rm=TRUE), p24=sum(p24Inches, na.rm=TRUE)) %>%
    filter(p1>0 | p24 > 0)
## # A tibble: 15 x 3
##      day    p1   p24
##    <int> <dbl> <dbl>
##  1     1  0.05  0.07
##  2    11  0.03  0.03
##  3    12  0.01  0   
##  4    13  0.63  0.64
##  5    17  0.08  0.05
##  6    18  8.29  0   
##  7    19  0.03  0   
##  8    20  0.71  0.02
##  9    21  0.77  0.71
## 10    22  0     0.77
## 11    24  0.32  0   
## 12    25  0     0.32
## 13    27  0.6   0.6 
## 14    29  0.01  0.01
## 15    30  0.6   0.06
# Explore the April 18 data
kiah_2016 %>%
    filter(lubridate::date(dtime) %in% as.Date(c("2016-04-18", "2016-04-19"))) %>%
    pull(origMETAR)
##  [1] "KIAH 180053Z 11015KT 7SM OVC012 22/21 A2997 RMK AO2 RAE50 SLP147 P0001 T02170211"                                                                                                                                    
##  [2] "KIAH 180153Z 11018G22KT 7SM OVC011 22/21 A2998 RMK AO2 SLP152 FRQ LTGIC CB DSNT SW-W MOV NW T02170211"                                                                                                               
##  [3] "KIAH 180253Z 12025G38KT 7SM -RA OVC012 22/21 A2997 RMK AO2 PK WND 11038/0246 RAB12 SLP149 FRQ LTGIC CB DSNT SW-NW MOV NW P0000 60001 T02170206 50000"                                                                
##  [4] "KIAH 180353Z 11026G33KT 7SM -TSRA OVC014CB 22/21 A2998 RMK AO2 PK WND 11033/0353 TSB21RAE0255B45 SLP151 FRQ LTGICCG TS SW-NW MOV N P0000 T02220206"                                                                  
##  [5] "KIAH 180453Z 11024G36KT 7SM TS BKN014 OVC020CB 22/21 A2997 RMK AO2 PK WND 12036/0447 RAE0356 SLP149 CONS LTGICCG TS SW-NW MOV N P0000 T02220206"                                                                     
##  [6] "KIAH 180553Z 29010KT 6SM +TSRA BR FEW009 BKN014 OVC020CB 19/18 A3004 RMK AO2 PK WND 14037/0517 RAB14 SLP173 CONS LTGICCG TS S-W-N MOV NE P0018 60019 T01890183 10222 20189 402440189 53024"                          
##  [7] "KIAH 180653Z 07007KT 1SM R27/3000VP6000FT +TSRA BR FEW004 BKN011CB OVC020 18/18 A3002 RMK AO2 SLP166 CONS LTGICCG TS W-NE MOV NE P0391 T01780178"                                                                    
##  [8] "KIAH 180753Z 09010KT 8SM -RA OVC009 20/20 A2998 RMK AO2 TSE40 PRESFR SLP150 OCNL LTGICCC DSNT W NE TS MOVD NE CB DSNT W NE MOV NE P0021 T02000200"                                                                   
##  [9] "KIAH 180853Z 13022G31KT 7SM -TSRA BKN009CB OVC017 21/21 A2993 RMK AO2 PK WND 13031/0847 TSB14 SLP136 CONS LTGICCCCG TS VC S-W-N MOV NE P0002 60414 T02110211 58037"                                                  
## [10] "KIAH 180953Z 31020G24KT 1/16SM R27/3000VP6000FT +TSRA FG OVC005CB 18/18 A3000 RMK AO2 PK WND 13030/0854 SLP157 CONS LTGICCG TS OHD MOV NE P0319 T01830183 $"                                                         
## [11] "KIAH 181053Z 02010KT 7SM -TSRA FEW010 SCT032CB OVC050 18/18 A3002 RMK PK WND 36029/1001 WSHFT 0953 OCNL LTGICCC TS N-SE-S MOV NE SLPNO PNO"                                                                          
## [12] "KIAH 181153Z 03005KT 7SM -TSRA FEW010 BKN030CB OVC065 19/18 A3005 RMK FRQ LTGICCC TS VC SE-SW MOV E"                                                                                                                 
## [13] "KIAH 181253Z 11006KT 3SM -TSRA BR FEW012 OVC027CB 19/18 A3006 RMK FRQ LTGICCC TS VC SE-SW MOV E"                                                                                                                     
## [14] "KIAH 181353Z 00000KT 4SM +TSRA BR FEW006 BKN015CB OVC030 18/ A3010 RMK AO2 SLP192 TS FRQ LTGICCG NE-SE MOVG E T0178 $"                                                                                               
## [15] "KIAH 181453Z 10005KT 4SM RA BR FEW008 SCT020CB OVC080 18/18 A3008 RMK AO2 SLP187 TSE39 TS MOVD E CB ALQDS MOVG E P0034 6//// T01780178 $"                                                                            
## [16] "KIAH 181553Z 13005KT 4SM TSRA BR FEW008 SCT020CB OVC100 18/18 A3012 RMK AO2 TSB11 SLP199 TS FRQ LTGICCCCG ALQDS MOVG E P0009 T01780178 $"                                                                            
## [17] "KIAH 181653Z 10004KT 4SM +TSRA BR BKN005 BKN020CB OVC090 18/18 A3012 RMK AO2 SLP198 TS FRQ LTGICCCCG N-E-S MOVG E P0023 T01780178 $"                                                                                 
## [18] "KIAH 181753Z VRB06KT 7SM -RA SCT005 SCT020CB OVC090 18/18 A3012 RMK AO2 TSE22 SLP200 TS MOVD E CB S-SW MOVG E PCPN FROM 0600Z TO 1753Z 9.87 INCHES PCPN FROM 0500Z TO 1753Z 10.58 INCHES P0007 6//// T01830183 10211"
## [19] "KIAH 181853Z 13009KT 10SM -RA SCT007 SCT020 OVC110 18/18 A3006 RMK AO2 SLP180 CB DSIPTD P0004 T01830183"                                                                                                             
## [20] "KIAH 181953Z 09010KT 10SM -RA BKN009 OVC110 19/19 A3007 RMK AO2 RAE19B40 SLP181 P0000 T01940189"                                                                                                                     
## [21] "KIAH 182053Z 07007KT 10SM OVC011 19/19 A3004 RMK AO2 RAE16 SLP171 P0000 60004 T01940189 56029"                                                                                                                       
## [22] "KIAH 182153Z 13008KT 10SM FEW013 SCT110 BKN250 21/19 A3001 RMK AO2 SLP163 T02060189"                                                                                                                                 
## [23] "KIAH 182253Z 00000KT 10SM FEW018 SCT110 SCT180 BKN250 21/18 A3003 RMK AO2 SLP169 T02110183"                                                                                                                          
## [24] "KIAH 182353Z 03006KT 10SM BKN021 BKN180 BKN250 21/18 A3004 RMK AO2 SLP173 60004 T02060178 10211 20183 53002"                                                                                                         
## [25] "KIAH 190053Z 06003KT 10SM BKN022 OVC140 20/19 A3007 RMK AO2 SLP182 T02000189"                                                                                                                                        
## [26] "KIAH 190153Z 32003KT 9SM -RA BKN022 OVC027 21/19 A3010 RMK AO2 RAB50 SLP193 P0000 T02060189"                                                                                                                         
## [27] "KIAH 190253Z 31003KT 10SM FEW029 SCT036 BKN045 20/19 A3011 RMK AO2 RAE14 SLP195 P0000 60000 T02000189 51023"                                                                                                         
## [28] "KIAH 190353Z 07007KT 10SM -RA FEW045 SCT060 BKN100 BKN130 19/19 A3009 RMK AO2 RAB41 SLP187 P0001 T01890189"                                                                                                          
## [29] "KIAH 190453Z 07006KT 10SM FEW024 SCT050 BKN150 BKN170 19/19 A3005 RMK AO2 RAE26 PRESFR SLP175 P0000 T01890189"                                                                                                       
## [30] "KIAH 190553Z 12004KT 10SM FEW025 SCT100 BKN150 OVC200 19/19 A3005 RMK AO2 SLP174 60001 T01890189 10211 20189 402110178 56021"                                                                                        
## [31] "KIAH 190653Z 00000KT 10SM FEW025 SCT100 OVC170 19/19 A3002 RMK AO2 SLP165 T01890189"                                                                                                                                 
## [32] "KIAH 190753Z 00000KT 10SM FEW007 SCT020 SCT100 OVC170 19/19 A3003 RMK AO2 SLP169 T01890189"                                                                                                                          
## [33] "KIAH 190853Z 00000KT 10SM FEW007 BKN025 OVC250 19/19 A3005 RMK AO2 SLP175 T01890189 53001"                                                                                                                           
## [34] "KIAH 190953Z 36003KT 8SM SCT003 SCT030 BKN250 18/18 A3005 RMK AO2 SLP176 T01830183"                                                                                                                                  
## [35] "KIAH 191053Z 02004KT 5SM BR BKN003 BKN042 BKN250 18/18 A3006 RMK AO2 SLP178 T01830183"                                                                                                                               
## [36] "KIAH 191153Z 05003KT 1/4SM R27/5000VP6000FT FG VV002 18/18 A3006 RMK AO2 SLP180 7//// T01830183 10194 20183 53004"                                                                                                   
## [37] "KIAH 191253Z 07005KT 3SM BR BKN003 BKN040 OVC060 19/19 A3008 RMK AO2 SLP186 T01890189"                                                                                                                               
## [38] "KIAH 191353Z 00000KT 6SM BR SCT003 BKN028 BKN060 OVC130 21/21 A3008 RMK AO2 SLP184 T02060206"                                                                                                                        
## [39] "KIAH 191453Z 02003KT 8SM FEW007 OVC029 22/21 A3007 RMK AO2 SLP182 T02220206 50003"                                                                                                                                   
## [40] "KIAH 191553Z 10004KT 10SM SCT014 BKN035 OVC190 23/21 A3006 RMK AO2 SLP180 T02330206"                                                                                                                                 
## [41] "KIAH 191653Z 11004KT 10SM SCT018 BKN039 OVC220 24/20 A3005 RMK AO2 SLP176 T02440200"                                                                                                                                 
## [42] "KIAH 191753Z 07004KT 10SM BKN020TCU BKN040 OVC220 24/21 A3004 RMK AO2 SLP171 TCU NE T02440206 10250 20183 58011"                                                                                                     
## [43] "KIAH 191853Z 13004KT 10SM BKN025 BKN040 OVC220 24/20 A3003 RMK AO2 SLP167 CB DSNT E S SW T02440200"                                                                                                                  
## [44] "KIAH 191953Z 13008KT 10SM BKN030CB BKN050 OVC250 25/20 A3001 RMK AO2 RAB06E27 SLP161 CB E AND S MOVG LTL P0002 T02500200"                                                                                            
## [45] "KIAH 192053Z 14011KT 10SM SCT030 BKN060 BKN250 25/20 A2999 RMK AO2 RAB24E36 SLP154 CB DSIPTD P0000 60002 T02500200 58017"                                                                                            
## [46] "KIAH 192153Z 13011G17KT 10SM SCT028 SCT045 BKN250 26/21 A2996 RMK AO2 SLP144 T02560206"                                                                                                                              
## [47] "KIAH 192253Z 12013KT 10SM FEW025 SCT034 BKN060 BKN250 24/19 A2997 RMK AO2 SLP148 T02390194"                                                                                                                          
## [48] "KIAH 192353Z 13013KT 10SM FEW024 SCT060 SCT080 SCT250 23/19 A2996 RMK AO2 SLP146 CB DSNT W 60002 T02280194 10261 20228 56008"

Houston experienced near-record flooding on April 18, 2016 due to ~10 inches of rain. The 1-hour totals appear to capture this, while the 24-hour total is absent on April 18 and 7//// (data bad) on April 19.

Next steps are to combine all of the precipitation functions, allow for output to be directed to a log file, and consolidate/save all processed EDA for later use.